Arsenic - Safety Re
In the Arsenic project, I evaluated AI-generated responses across complex conversational contexts to ensure they followed safety guidelines and policy standards. My tasks included scoring responses for safety, fluency, coherence and policy alignment, identifying hallucinations and risky output, and flagging violations such as harmful instructions or sensitive-content breaches. I also compared pairs of model outputs (A/B testing) and selected the version that best aligned with the guidelines. In addition, I created and reviewed adversarial prompts and image-based scenarios designed to test whether the model would respond safely or be triggered into guideline-breaking behavior. The role required nuanced judgment, consistency, and strong understanding of risk categories, edge cases, and human-AI interaction.