Founder & Lead Engineer — Data Labeling, QA & LLM Output Evaluation, MetaOrcha
Led systematic AI failure-mode analysis and behavioral validation for agent outputs on a production-grade LLM orchestration platform. Developed and enforced labelling-equivalent guidelines, correctness criteria, and ambiguity escalation paths for output review. Benchmarked functional equivalence across LLM providers and stress-tested third-party agent network outputs for quality benchmarking. • Defined annotation-style acceptance criteria for agentic outputs. • Crafted adversarial test suites targeting model behaviors and failure cases. • Built processes for edge-case enumeration, inter-annotator agreement, and output ambiguity handling. • Used internal/proprietary tooling with evaluation rubrics for ongoing quality assurance.