Senior Software Engineer — AI Search & Evaluation
Reviewed and graded LLM outputs from models like GPT-4, Claude, and Cohere for retrieval and ranking quality. Designed and maintained task-specific evaluation rubrics measuring factual accuracy, relevance, grounding, and correctness in AI outputs. Authored evidence-based rationales for identifying hallucinations, edge cases, and model failures. • Graded outputs using detailed criteria and user-simulation runs. • Calibrated evaluation tasks for production datasets in AI pipelines. • Applied systematic rubric-driven scoring and feedback to model responses. • Utilized proprietary internal tooling for annotation and assessment.