LLM Output Evaluation & Prompt-Response Annotation for Conversational AI
Evaluated and annotated multiple of AI-generated prompt-response pairs for quality, safety, coherence, and factual grounding — simulating real-world LLM training data refinement tasks. Tasks included ranking multiple model outputs, flagging hallucinations or toxic content, rewriting ambiguous prompts for clarity, and classifying responses by tone (helpful, neutral, harmful).