AI platform builder, LLM evaluator
As a personal portfolio project, I built and evaluated an AI legal tech platform prototype, focusing on output quality and safety. My work involved evaluating LLM-generated legal text utilizing the LLM-as-a-judge architecture, the GEval metric implemented in Python's Deepeval library and Windsurf. I used strict rubric criteria to assess hallucinations, bias, and security risks in model outputs. • Evaluated legal tech LLM outputs for grounding, accuracy, bias, and safety. • Tuned rubrics and grading criteria using advanced evaluation tooling. • Leveraged Python's Deepeval library with GEval metrics for systematic review. • Applied LLM-as-a-judge architecture in the legal domain.