AI Research Associate, Scale AI
Evaluated AI model performance and conducted structured error analysis on outputs from various LLMs. Generated actionable feedback to improve prompt design and influence model tuning and output consistency. Used Python pipelines to automate model evaluation, reducing manual validation and standardizing output comparisons across GPT-4, Claude, and Mistral. • Evaluated outputs using RLHF and custom metrics • Documented errors to improve prompt engineering • Automated model evaluation pipelines for reliability • Provided analysis for model tuning and output consistency