Adversarial Data Labeling & Red-Teaming for LLM Safety
Led red-teaming initiatives to identify vulnerabilities in language models through adversarial evaluation. Designed and deployed custom adversarial test sets targeting hallucination and fairness weaknesses. Results directly improved model safety protocols and guided mitigation development. • Structured evaluation of unsafe and biased LLM outputs. • Implemented mitigation experiments improving fairness metrics. • Integrated adversarial results into retraining and feedback loops. • Reported findings to research and product teams.