Red Team Attack
In this project, I completed hundreds of tasks, all with the intention of improving models' abilities and outputs on the basis of harmlessness. This project contained significant amounts of harmful and adversarial content, required to be analyzed according to detailed guidelines. In addition to utilizing human feedback, labeling, and training, to renounce unsafe outputs and implement gold standard safe responses or capability refusals, I was responsible for utilizing prompt engineering to curate red team attacks, in a purposeful attempt to elicit harmful responses from the model. In the event the model did produce a harmful response after being probed to do so, I would then utilize preference ranking, and in depth human feedback as well as creating an expansive rubric and guidelines highlighting outputs and content the model should neither produce nor engage with.