AI Safety Specialist – RLHF Response Evaluation & Guideline Design
created and assessed RLHF training tasks to assess model compliance in situations involving conflicting instructions. Created eight challenge categories targeting cognitive inertia, formatting violations, flawed premises, and guideline overrides. examined and scored AI responses using stringent evaluation criteria