LLM Evaluation & RLHF Data Annotation
Conducted large-scale annotation and evaluation of language model outputs within the Scale AI ecosystem (Outlier.ai). Assessed responses for reasoning accuracy, hallucination detection, instruction-following, and factual consistency. Applied structured evaluation rubrics to score and rank outputs, generating high-quality feedback used in RLHF pipelines to improve model alignment and dataset quality.