We’re looking for an analytical scenario writer with strong QA-style thinking and excellent written English. You should be comfortable designing structured evaluation scenarios, defining expected (“gold standard”) agent behavior, and working with structured formats like JSON/YAML. A background in software testing, QA, data analysis, or NLP annotation is strongly preferred. Basic Python and JavaScript experience is required. What you’ll be doing: You’ll design realistic, reusable evaluation scenarios for LLM-based agents that simulate real-world tasks. You’ll define the golden path and acceptable behaviors, annotate task steps and expected outputs, and document edge cases and scoring logic. You’ll also review agent outputs, iterate on scenarios for clarity and coverage, and collaborate with developers and other contributors to test and refine evaluation frameworks.
Total Budget
$4,800
Pay per Label
$24/hr
Time Requirement
20+ hrs/week
Duration
6+ months
Agent evaluation scenarios and test cases
Software
Hiring Type
Required Location
Workload / Schedule
Flexible can start immediately
Software
Data Type
Label Types
Subject Matter / Industry
Language
Job Type
Share link