RLHF Preference Annotation Dataset Creator
Led the creation and publication of an RLHF preference annotation dataset focused on AI ethics. Developed 95 prompts with 190 response pairs, detailed scoring criteria, and failure modes for each annotation. Contributed written justifications for dispreferred responses to improve AI model alignment. • Published dataset on Hugging Face. • Annotated across six ethics categories: refusal edge cases, sycophancy, parasocial attachment, anthropomorphism, bias, and dual-use harm. • Used five scoring dimensions and explicit failure mode labeling. • Built a DPO fine-tuning notebook using Colab and microsoft/phi-2 with LoRA adapters.