Evaluating Model Responses
Rated the responses of 2-4 models to a given prompt using specified axes. Generally rated based on truthfulness, verbosity, completeness, instruction following. Created rubrics for prompts beginning with "Does the response...?".