For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Bob Curtis

Bob Curtis

Senior AI Evaluation Specialist - Machine Learning Systems

USA flag
California, Usa
$35.00/hrExpertScale AI

Key Skills

Software

Scale AIScale AI

Top Subject Matter

No subject matter listed

Top Data Types

TextText

Top Label Types

Classification
RLHF
Evaluation Rating
Red Teaming
Prompt Response Writing SFT

Freelancer Overview

With over 8 years of experience in AI systems evaluation, quality assurance, and model benchmarking, I specialize in designing and executing rigorous data labeling and annotation frameworks for complex AI and machine learning systems. My background includes hands-on work with NLP and large language models at leading tech companies, where I developed structured validation protocols, identified critical edge cases, and established gold standard benchmarks for agent reasoning and decision-making. I am highly skilled in scenario design, logical consistency review, and structured data evaluation using tools like Python, SQL, JSON, and YAML. My work has directly informed improvements in system reliability and interpretability, and I am passionate about ensuring high-quality, precise training data that drives the development of robust AI solutions.

ExpertEnglish

Labeling Experience

Scale AI

Large Language Model (LLM) Response Evaluation & RLHF Annotation

Scale AITextClassificationRLHF
Contributed to large-scale LLM training and evaluation initiatives focused on supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). Annotated and evaluated thousands of prompt-response pairs for quality, safety, reasoning accuracy, factual correctness, and policy compliance. Key responsibilities included: Ranking multiple model outputs based on coherence, helpfulness, and alignment with guidelines Labeling reasoning errors, hallucinations, and logical inconsistencies Performing intent classification and sentiment tagging Conducting red teaming to identify safety vulnerabilities and edge cases Writing high-quality prompt-response examples for supervised fine-tuning Validating structured outputs (JSON/YAML) for schema adherence and completeness Worked with detailed annotation rubrics to maintain high inter-annotator agreement (IAA) and consistency. Participated in calibration sessions and secondary review workflows to ensure labe

Contributed to large-scale LLM training and evaluation initiatives focused on supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). Annotated and evaluated thousands of prompt-response pairs for quality, safety, reasoning accuracy, factual correctness, and policy compliance. Key responsibilities included: Ranking multiple model outputs based on coherence, helpfulness, and alignment with guidelines Labeling reasoning errors, hallucinations, and logical inconsistencies Performing intent classification and sentiment tagging Conducting red teaming to identify safety vulnerabilities and edge cases Writing high-quality prompt-response examples for supervised fine-tuning Validating structured outputs (JSON/YAML) for schema adherence and completeness Worked with detailed annotation rubrics to maintain high inter-annotator agreement (IAA) and consistency. Participated in calibration sessions and secondary review workflows to ensure labe

2021 - 2025

Education

S

Stanford University

Doctor of Philosophy, Computer Science, Machine Learning Evaluation

Doctor of Philosophy
2012 - 2016
U

University of California, Berkeley

Master of Science, Data Science

Master of Science
2012 - 2013

Work History

G

Google

Senior AI Evaluation Specialist

California
2021 - Present
T

Tesla

Machine Learning Systems Analyst and QA Lead

Palo Alto
2018 - 2021