Iteratively edit to improve a user request for code generation, with final response ranking, rubrics and unit tests
This project evaluates agent-completion conversations where models use structured tool APIs to fulfill user requests. The focus is on verifying the model's correctness, proper use of tools, and whether the final output aligns with what the user wanted.