LLM Evaluation and Data Labeling Lead
Oversaw benchmarking and evaluation of multiple large language models for optimization of performance and reasoning. Designed and implemented prompt engineering frameworks and evaluation loops to improve consistency and reliability of LLM outputs. Led efforts to enhance data labeling protocols for hallucination tracking and metric evaluation. • Evaluated LLM outputs across structured and unstructured datasets. • Designed prompt engineering processes for improved label quality. • Built observability and evaluation systems to monitor labeling success and hallucination rates. • Mentored team on LLM evaluation and production readiness best practices.