LLM Prompt Evaluation and Multi-turn AI Assistant Training
Worked on improving large language model (LLM) performance through structured prompt-response evaluations. Responsibilities included reviewing multi-turn assistant dialogues in English and Spanish, scoring based on clarity, helpfulness, tone, and factual accuracy, and tagging issues like hallucinations, incomplete reasoning, tool misuse, or user intent mismatch. Produced high-quality feedback and critic comments to guide model improvements. Evaluated prompts across various domains, including productivity, education, and general user queries. Maintained high-quality ratings across thousands of evaluations while meeting rigorous deadlines and task quotas.