Software Development Engineer in Test (LLMs)
In this role, I evaluated and ranked large language model (LLM) outputs across a variety of subject areas. I contributed structured annotations, generated datasets, and authored rationales to improve supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) pipelines. My work ensured high dataset quality, factual accuracy, and ethical alignment in AI systems. • Benchmarked model outputs against human baselines using Python-based Evals. • Authored rationales for evaluation decisions to improve AI model alignment. • Created prompt sets and datasets targeting conversational and reasoning enhancements. • Collaborated with researchers to calibrate reward models via structured feedback.