Safety & Preference Alignment Labeling for Conversational LLM Fine-Tuning
I worked on a large scale Reinforcement Learning from Human Feedback (RLHF) pipeline to fine-tune a generative AI assistant to be safer and follow instructions better as a Tier 2 annotator and auditor. This involved ranking preferences instead of binary ranking as well as "red-teaming". Major Accomplishments & Skills Showcased Multi-Turn Context Reviews: Completed reviews of conversations with more than 10 dialogue turns to check that the model remained consistent and didn't provide contradictory/unsafe responses over longer conversations. Adversarial Red-Teaming & Edge Case Labeling: Human-written edge-case prompts created to elicit hallucinations/bias/too-much refusal. Counterfactual responses were written to train the model why a particular response was harmful instead of just labeling it "bad". Granular Preference Ranking: Performed side-by-side preference labeling with granular written explanations. Prioritized tone, succinctness, and strict adherence to safety guidelines over objective correctness. Impact: My labels improved hallucination rate by 15% on test set and led to less defensive responses when refusing harmful requests.