LLM Training Data Evaluation Specialist
Evaluated 5,000+ prompt-response pairs for frontier large language model training, focusing on improving model reasoning, factuality, and safety alignment. Performed comparative evaluations between model variants to identify behavioral improvements and regressions. Specialized in multi-turn conversation analysis, chain-of-thought reasoning assessment, and hallucination detection across business, technical, and general knowledge domains. Conducted red teaming exercises to identify vulnerabilities in AI safety guardrails, including testing for privacy violations, unauthorized data disclosure, and policy compliance failures. Created evaluation rubrics with 30-60+ objective criteria to measure response quality, helpfulness, harmlessness, and instruction-following accuracy. Work contributed to RLHF training pipelines for production AI systems deployed to millions of users.