Hermes
Project Hermes was a large scale evaluation programme for Artificial Intelligence (AI) created to evaluate and enhance the performance of generative models across real world business use cases. The project processed tens of thousands of input output pairs each week, with global teams of people pouring through huge volumes of data on a daily basis. In that context, I assessed the quality of model responses based on in-depth considerations of the HEMC guidelines, and evaluated helpfulness, correctness, language quality, faithfulness to the source text, risks of hallucinations, and correct/incorrect disclaimer/completion handling. Each unit called for structured reasoning, close reading of raw text and adherence to defined rubrics. Quality control was also rigorous, with the inclusion of gold standard benchmarking, blind double reviews, inter rater agreement monitoring, audit sampling, and enforced accuracy thresholds.