Project Coffee - Generative AI Text Evaluation
Evaluated the quality, safety, and coherence of text generated by Large Language Models (LLMs) to improve conversational AI capabilities. Conducted side-by-side (SXS) comparisons of AI responses, rating them based on helpfulness, honesty, and harmlessness (HHH framework). Identified and annotated hallucinations, factual errors, and logical inconsistencies in complex query responses. Authored high-quality "Golden Set" rewrites to train models on preferred prose styles and reasoning patterns.