Independent Consultant
At Abundant (YC), the project scope centered on engineering a high-fidelity adversarial evaluation framework for State-of-the-Art (SOTA) LLM agents, specifically targeting GPT-4 and Claude 3.5. The objective was to move beyond standard benchmarks by creating complex, multi-step environments that tested the limits of agentic control and safety alignment in professional settings. Specific Data Labeling & Curation Tasks: Rather than basic annotation, my tasks involved the creation and labeling of "adversarial scenarios." This included: Compliance Trap Engineering: Designing and labeling 50+ unique prompt-based environments where a model was forced to choose between a direct user instruction and a conflicting, documented safety or bureaucratic constraint. Vulnerability Mapping: Identifying and categorizing instances of "sycophancy for bureaucracy," where models prioritized following rigid documentation over logical safety, and labeling these failure modes to train detection systems. Indirect Prompt Injection: Constructing datasets where malicious instructions were "hidden" within legitimate-looking data (e.g., a PDF or a simulated database) to evaluate the model's ability to filter unsafe context during RAG-style operations. Project Size: The project involved the curation and validation of novel dataset environments specifically for agentic control research. While the primary focus was on "quality over quantity" to ensure high-difficulty edge cases, the work supported the evaluation of multiple LLM generations across various safety-critical configurations. Quality Measures Adhered To: To ensure the robustness of the benchmark data, I adhered to several rigorous quality protocols: Zero-Inference Consistency: Every adversarial label was verified against a ground-truth logic tree to ensure that the "trap" was objectively solvable for an aligned model. Adversarial Stability: I conducted iterative "red-teaming" on my own datasets to ensure they maintained their difficulty across different model architectures and weren't bypassed by simple prompt hacks. Documentation-Alignment Check: Each scenario was cross-referenced with official model safety guidelines to ensure the "conflicting constraints" accurately reflected real-world deployment risks.