LLM Response Evaluation and Preference Ranking
Participated in ongoing evaluation and annotation projects for large language models. Primary tasks included comparing multiple AI model outputs, ranking responses based on accuracy, helpfulness, instruction following, and safety guidelines, and providing detailed justifications for preferences. Completed hundreds of tasks involving diverse prompts such as email drafting, data categorization, reasoning problems, and structured output validation. Maintained high consistency and quality scores by strictly adhering to provided rubrics and project guidelines. Contributed to improving model performance through high-quality human feedback in reinforcement learning from human feedback (RLHF) workflows.