LLM Output Evaluation – Spanish
This project involved evaluating the outputs of a multilingual large language model (LLM), specifically in Spanish. Tasks included prompt + response rating, verifying factual correctness, fluency, and relevance of generated text, and performing classification on model behavior across question-answering tasks. I was also involved in suggesting prompt improvements and writing alternative completions. The dataset included over 10,000 samples across varied domains including customer service, general knowledge, and conversational AI. Strict quality control measures were applied, including inter-annotator agreement checks and gold standard reviews.