Computer Use AI Evaluator
This project centered on evaluating large language model outputs with a strong emphasis on accuracy, instruction-following, and safety/compliance. The scope involved applying structured rubrics to systematically assess responses, ensuring they met defined standards while identifying areas for improvement. The project size required consistent engagement with complex instruction-following notebooks, where you provided clear and actionable feedback to refine output quality. Specific evaluation tasks included resolving ambiguous edge cases, aligning on scoring standards, and documenting recurring failure patterns that could undermine reliability