LLM Coding Evaluation & Python Data Labeling Specialist
Worked on LLM evaluation and human feedback projects focused on code generation and reasoning tasks (Python-heavy). Rated model-generated responses across multiple structured dimensions including Instruction Following, Truthfulness, Style & Clarity, Verbosity, and Overall Quality. Compared dual responses, assigned preference rankings, and wrote detailed justifications explaining reasoning and logical trade-offs. Ensured strict adherence to system prompts, user prompts, and conversation history context when evaluating outputs. Identified hallucinations, logical inconsistencies, instruction violations, and core-requirement failures in technical responses. Contributed to reinforcement learning from human feedback (RLHF) pipelines for improving code reliability and reasoning alignment.