Evaluating Instruction Following Capabilities of Large Language Models on Structured & Unstructured Tasks
Evaluated large language model (LLM) outputs against benchmark text datasets to assess instruction adherence, response accuracy, and overall quality. Applied structured rating criteria and quantitative metrics such as accuracy and F1-score to compare model performance across different prompt types. Built Python-based validation workflows for preprocessing, scoring, and error analysis to ensure consistent, repeatable, and high-quality evaluation processes.