AI Engineer & Data Scientist – LLM Output Evaluation and Human-in-the-loop Feedback
I evaluated and ranked the outputs from large language models (LLMs) such as GPT-4o and Llama for reasoning, factual accuracy, and safety criteria. I designed and implemented large-scale LLM evaluation pipelines, integrating structured workflows and human-in-the-loop feedback for iterative model improvement. This process contributed to enhanced model reliability, reduced hallucination rates, and improved benchmarking efficiency. • Implemented evaluation frameworks optimizing response quality. • Collaborated on RLHF-style feedback systems to refine model alignment. • Leveraged prompt engineering to target error reduction in outputs. • Used benchmarking metrics to increase efficiency across multiple domains.