Factual Accuracy Evaluation for Generative AI Responses
This project involved the critical evaluation and labeling of text responses generated by a large language model (LLM) for factual accuracy. The scope required analyzing a diverse dataset of prompts and responses covering topics such as sports history, scientific concepts, and music trivia. My specific labeling tasks included: Performing binary classification, tagging each response as "Accurate" or "Contains Inaccuracy." Conducting thorough fact-checking by cross-referencing every claim against reliable, verifiable sources. Identifying and categorizing specific error types, including conceptual inaccuracies (e.g., misidentifying a natural mineral as synthetic), factual data errors (e.g., attributing a solo artist's work to a band), and typographical errors in proper nouns. Strictly adhering to detailed client guidelines to ensure consistent and unbiased judgment across all data points. The project involved a deliverable dataset for a key client.