AI Model Evaluator & Red Teamer
Evaluated large language model (LLM) outputs for correctness, logical reasoning, and consistency across structured benchmark tasks. Designed and applied evaluation rubrics to assess model behavior and analyzed error patterns in diverse mathematical and computational problem sets. Performed adversarial prompt engineering to stress-test LLMs for alignment and robustness improvements. • Contributed to reinforcement learning from human feedback (RLHF) and alignment workflows. • Used internal or proprietary tools for annotation and evaluation tasks. • Specialized in evaluating text and mathematical responses. • Collaborated remotely and documented outcomes to inform model improvement.