AI Response Text Evaluation & Rating — LLM Quality Assessment
Contributed to an AI training project focused on evaluating and rating large language model (LLM) outputs for quality, accuracy, and safety. Tasks involved reading AI-generated responses to a wide range of prompts and applying structured rubrics to score each output across multiple dimensions, including factual correctness, clarity, relevance, coherence, tone, and instruction-following. Also performed pairwise comparisons — reviewing two AI responses side by side and ranking which better satisfied the original prompt — a technique known as RLHF (Reinforcement Learning from Human Feedback) preference annotation. Flagged outputs that contained misleading information, harmful content, bias, or formatting errors, and provided written explanations to justify every rating decision. Maintained consistent judgment across high-volume annotation queues by following detailed project guidelines and escalating ambiguous cases through the correct channels.