multimodal evaluation
I reviewed paired image and text prompts to check whether descriptions matched the visual content and whether the model’s reasoning followed from what was shown. The tasks involved identifying mismatches, labeling incorrect interpretations and scoring the quality of short reasoning steps. I followed detailed instructions, applied them consistently and documented cases where images allowed for multiple reasonable readings. The work required careful visual attention and steady judgment across many items.