LLM Evaluation/Preference Ranking Rater
Performed evaluation and preference ranking tasks for language model outputs within the context of science and math education content. Compared AI-generated text against rubrics for logical accuracy, domain fidelity, and academic rigor. Provided human feedback to refine model selection and improve response generation quality. • Used comparative rating to improve model performance for scientific tasks. • Assessed clarity, correctness, and completeness of AI-generated solutions. • Evaluated and scored outputs for LLMs working on physics and math questions. • Documented evaluation results to support ongoing model improvement cycles.