Response Evaluation
I would be given a prompt, and I would need to identify the prompt's language as well as the type of prompt. Then I would need to evaluate 2 responses from different models on instruction following, format, accuracy, helpfulness, and overall quality.