Evaluation of AI-Generated Text Responses
I evaluated responses generated by Large Language Models (LLMs), rating accuracy, coherence, and relevance according to project guidelines. Tasks included checking factual correctness, identifying hallucinations, assessing safety risks, and comparing multiple model outputs to determine the best response. I consistently followed annotation standards to improve model quality and behavior.