For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
A

Anthony Philip

AI Evaluation/LLM Benchmarking Engineer

Kenya flagToronto, Kenya
$20.00/hrExpertAppenCloudfactoryData Annotation Tech

Key Skills

Software

AppenAppen
CloudFactoryCloudFactory
Data Annotation TechData Annotation Tech
MercorMercor
Micro1

Top Subject Matter

Multilingual AI Evaluation
LLM Benchmarking
Legal Services & Contract Review

Top Data Types

TextText
DocumentDocument
Computer Code ProgrammingComputer Code Programming

Top Task Types

Text GenerationText Generation
TranscriptionTranscription
Evaluation/RatingEvaluation/Rating
Data CollectionData Collection
Prompt + Response Writing (SFT)Prompt + Response Writing (SFT)
ClassificationClassification

Freelancer Overview

AI Evaluation/LLM Benchmarking Engineer. Brings 9+ years of professional experience across complex professional workflows, research, and quality-focused execution. Core strengths include Internal and Proprietary Tooling. Education includes Bachelor of Science, Massachusetts Institute of Technology (MIT) (2018) and Master of Science, Stanford University (2020). AI-training focus includes data types such as Text and labeling workflows including Evaluation and Rating.

ExpertEnglish

Labeling Experience

AI Evaluation/LLM Benchmarking Engineer

Text
As a Senior Software Engineer, I evaluated AI coding agents and implemented quality control protocols for LLM benchmarking. My work focused on multilingual data, prompt engineering, and rigorous human and automated audits. I assessed agent outputs for structural failures and edge cases in non-English environments. • Designed Terminal-Bench suites to challenge LLMs in multilingual contexts. • Built task environments with native-language datasets and realistic constraints. • Conducted iterative audits involving human review and LLM-based checks. • Ensured quality through multilayered evaluation and calibration processes.

As a Senior Software Engineer, I evaluated AI coding agents and implemented quality control protocols for LLM benchmarking. My work focused on multilingual data, prompt engineering, and rigorous human and automated audits. I assessed agent outputs for structural failures and edge cases in non-English environments. • Designed Terminal-Bench suites to challenge LLMs in multilingual contexts. • Built task environments with native-language datasets and realistic constraints. • Conducted iterative audits involving human review and LLM-based checks. • Ensured quality through multilayered evaluation and calibration processes.

2021 - Present

Education

S

Stanford University

Master of Science, Computer Science

Master of Science
2018 - 2020
M

Massachusetts Institute of Technology (MIT)

Bachelor of Science, Computer Science

Bachelor of Science
2014 - 2018

Work History

A

Applied AI & Multilingual Systems

Senior Software Engineer

Toronto
2021 - Present
S

Systems Infrastructure & Localization

Software Engineer

Toronto
2018 - 2020