Technical RLHF and Red teaming
Scope: Lead adversarial "Red Team" evaluation and RLHF alignment for a large language model (LLM) focused on technical and security domains. The project involved identifying logical vulnerabilities, prompt injection risks, and ensuring technical accuracy in generated code. Specific Tasks: Adversarial Prompting: Developing complex, multi-step prompts to stress-test model safety filters and ethical guardrails. RLHF Ranking: Evaluating and ranking model outputs based on a strict rubric of logical soundness, code efficiency, and security best practices. Code Debugging: Identifying and labeling security flaws (e.g., OWASP Top 10) in model-generated Python and Bash scripts. NER Classification: Manually tagging sensitive entities to prevent the leakage of PII or proprietary technical data. Project Size: Evaluated over 2,500+ complex technical prompts. Audited 1,000+ code blocks for functional and security compliance. Contributed to a high-priority model deployment for enterprise-grade security applications. Quality Measures: Multi-Pass Review: Every high-risk label underwent a self-audit and secondary verification against the project-specific security rubric. 0.98 Inter-Annotator Agreement (IAA): Consistently maintained top-tier alignment with expert consensus on complex ethical and technical edge cases. SOP Adherence: Strictly followed AWS Well-Architected Security Pillar guidelines during the handling and processing of all training datasets.