Apron evals
In this project I was given a prompt that asks the model to fix/modify/rewrite/explain/compile a specific code, or to write a code from scratch, and then I was given 2 model responses where I had to evaluate them in many caregories like Instructions following, Truthfulness, Clarity, Harmlessness and length, and after that to evaluate a side by side section between the two responses and choose the best one of them.