LLM Prompt-Response Evaluation – Outlier (2024-2025)
Contract evaluator reviewing 12 k+ bilingual (EN/ES) prompt-response pairs. Tasks: ranked alternatives, tagged error types (factuality, style, bias), and drafted improved answers. Achieved ≥ 98 % agreement with gold labels and helped raise overall model acceptance rate by 15 %.