Training models to respond safely to unsafe prompts
Prompt (with picture) would be provided, two model outputs. Judging model responses on whether they respond appropriately to the harm level of the prompt. (e.g. very harmful prompts should be met with a refusal, prompts with sensitive subject matters should receive a disclaimer)