Annotator
The project involves acting as an expert judge to evaluate AI-generated responses to complex multimodal requests by comparing two models (Media A and Media B) to see which better follows instructions and produces higher-quality output. The core evaluation is built on four distinct pillars: Instruction Following, which requires strict adherence to every part of a prompt with no partial credit given; Visual Quality, which assesses aesthetics and the preservation of the original image’s structure; AI-Generated Issues, which identifies "Red Flags" like distorted features, melting shapes, or impossible geometry; and Overall Preference, which weighs these factors to determine the most "trustworthy" result. A critical priority in this process is to rank Instruction Following and Realism/Naturalness above general aesthetic quality, ensuring that a "pretty" but incorrect image is not preferred over a more accurate one. Evaluators must use the full Conversation History as the sole source of truth, assessing each category independently while avoiding "Ties" unless no meaningful difference exists