AI Evaluation Specialist — Outlier.ai, OpenClaw Atlas Project
As an AI Evaluation Specialist at Outlier.ai, I designed, executed, and documented complex multi-model AI evaluations. My efforts included structured dataset creation, scenario development, and benchmarking model responses for fairness and performance. I developed failure taxonomies and managed evaluation runs across different universes and domains. • Crafted and utilized synthetic text datasets to test model reasoning and assess time-domain consistency. • Benchmarked single-turn model responses using identical prompts across multiple LLMs. • Developed taxonomies for categorizing and flagging failure modes such as privacy and defamation-risk. • Oversaw evaluation of agent-based tasks in both logistics/custody and healthcare subject domains.