Prompt engineer/AI labeller
Designed YAML-based benchmarks to evaluate LLM reasoning under implicit constraints, conflicting user intents, and real-world workflow dependencies. Built simulated environments with explicit world state, action schemas, and execution rules to objectively test decision sequencing and outcome correctness. Developed and audited evaluation rubrics to eliminate hidden failure modes and reliably differentiate model performance (pass/fail) without unfair ambiguity.