Code Reviewer at Outlier
Evaluated a model's ability to perform code review on pull requests. The task involved writing prompts that simulate real PR scenarios, then building rubrics to score whether the model correctly identified issues, proposed fixes that actually made sense, and could generate a clean final patch. The focus was not just on whether the model found bugs, but whether its feedback would be useful to a real developer reading it.