LLM Response Evaluation & Prompt Testing
Worked on evaluating and improving large language model (LLM) outputs for technical and educational use cases. Responsibilities included rating AI-generated responses for correctness, clarity, relevance, and safety, as well as identifying hallucinations, logical inconsistencies, and incomplete answers. Performed prompt and response writing (SFT-style tasks) to refine model behavior and improve output quality across question–answering, summarization, and text generation tasks. Evaluated responses involving programming explanations, conceptual learning content, and structured documentation-style text. Maintained consistent quality standards by following clear evaluation guidelines, cross-checking factual accuracy, and applying structured reasoning during reviews. Focused on producing high-quality, unbiased feedback suitable for RLHF and model improvement workflows.