LLM Evaluation Specialist & Code Interpreter Auditor (Freelance), Revelo
At Revelo, I specialized in auditing LLM-generated Python tasks and designed structured multi-turn coding evaluations to assess agent reasoning. I verified unit tests, reference solutions, and evaluation rubrics for robustness and logical consistency. My responsibilities included benchmarking model performance for alignment improvements and identifying systematic coding gaps. • Audited 30+ Python tasks in Docker sandbox environments • Designed multi-turn coding evaluations and assessed agent reasoning • Benchmarked GPT-4 and Claude outputs for code alignment • Identified critical bugs, hardcoding, and reproducibility issues