LLM Output Evaluation & Prompt Testing
Evaluated LLM-generated outputs for logical consistency, factual accuracy, and hallucination detection across AI platforms including Amazon Bedrock and Claude. Tested and graded prompt-response pairs in business, workflow, logistics, and healthcare contexts. Reviewed AI-generated content for general digital assistant use cases. Identified errors, inconsistencies, and edge cases. Performed quality checks in both Dutch and English. Applied structured evaluation criteria to assess output quality systematically.