LLM Text Evaluation & Instruction Tuning Annotator
Worked on multiple projects focused on evaluating and improving large language models. My tasks included rating AI-generated responses for correctness, completeness, clarity, safety, and instruction-following, as well as creating high-quality prompt + response pairs for supervised fine-tuning (SFT). I labeled text for question answering, explanations, summarization, and conversation-style outputs, and flagged harmful, biased, or low-quality responses as part of red teaming and safety review. The projects involved thousands of tasks, strict adherence to detailed guidelines, and regular quality checks, where I consistently maintained high agreement with gold-standard labels and platform QA metrics.