Rate and review: Plinia
Probing LLM robustness by intentionally triggering errors through ambiguous visual inputs, such as culturally loaded visuals, to elicit incorrect or hallucinated model responses. Captured raw LLM outputs and manually authored "golden responses" — accurate, unambiguous, culturally sensitive corrections — to serve as high-quality training targets.