AI Model Evaluation Contractor
As an AI Model Evaluation Contractor, I evaluated and rated large language model (LLM) outputs across six task categories using structured rubrics. I designed test prompts and scoring guidelines to assess model performance and reported on failure modes for future fine-tuning. I collaborated closely with a team to ensure consistent and high-quality inter-annotator agreement. • Completed over 2,400 output evaluation tasks, focusing on instruction-following, factual accuracy, and code quality. • Designed edge-case prompts and rubrics to systematically probe multi-step and ambiguous input handling by models. • Documented more than 140 model failure cases in structured feedback for reinforcement learning from human feedback (RLHF) prioritization. • Maintained inter-annotator agreement above 0.88 Cohen's Kappa through joint review and guideline sharing.