AI Evaluation/LLM Benchmarking Engineer
As a Senior Software Engineer, I evaluated AI coding agents and implemented quality control protocols for LLM benchmarking. My work focused on multilingual data, prompt engineering, and rigorous human and automated audits. I assessed agent outputs for structural failures and edge cases in non-English environments. • Designed Terminal-Bench suites to challenge LLMs in multilingual contexts. • Built task environments with native-language datasets and realistic constraints. • Conducted iterative audits involving human review and LLM-based checks. • Ensured quality through multilayered evaluation and calibration processes.