Independent AI Researcher and LLM Evaluator
Conducted structured evaluations of large language models (LLMs) to assess reasoning accuracy, hallucination risk, and behavioral consistency. Designed and implemented adversarial testing and prompt-based experiments targeting alignment weaknesses and vulnerabilities. Developed, executed, and documented 500+ structured prompts across multiple LLM platforms using Python automation tools. • Created comprehensive experiment libraries for analysis of LLM output reliability and edge-case failures. • Benchmarked and compared model performance across logical reasoning, knowledge generation, and research assistance tasks. • Developed internal tools for prompt testing, behavioral data collection, and experiment logging. • Applied evaluation frameworks to both proprietary and open-source models including ChatGPT, Claude, and Gemini.