AI Benchmark Design and Model Evaluation - Data Labeler
I contributed to the creation of multimodal and text-only AI benchmarks for model evaluation at Handshake AI. My responsibilities included designing, labeling, and analyzing difficult reasoning tasks and reviewing AI outputs for accuracy. I documented failure cases to support robust benchmarking and continuous improvement. • Designed and labeled multimodal and text-based reasoning tasks. • Evaluated model responses for completeness and validity. • Documented step-by-step solutions and outcome justifications. • Identified gaps in model understanding and reported issues.