Delivery Data Scientist
Developed and refined 30+ SAT-grade evaluation responses and improved reasoning consistency in LLMs. Evaluated 100+ LLM outputs to identify error patterns and strengthen internal benchmarking. Assessed multimodal text-image datasets for annotation quality, label accuracy, and evaluation alignment