Cypher Evals
I worked on the Cypher Evals project at Outlier, where I evaluated and stress-tested advanced language-model behavior across a wide range of tasks. My work involved assessing model outputs for correctness, reasoning quality, creativity, safety, and adherence to instructions. I performed detailed annotations, identified reasoning flaws, and graded multi-step solutions to help improve the model’s reliability