Cross-Lingual Semantic Similarity & Prompt Evaluation for LLMs
Worked on multiple high-level AI training projects focused on evaluating and annotating natural language data for large language models. One key project involved Cross-lingual Semantic Textual Similarity with Register and Politeness (XSTS+R+P), where I assessed Arabic-English sentence pairs for semantic equivalence, tone, and stylistic accuracy. Tasks included applying 1–5 rating scales with justification, flagging semantic mismatches, and ensuring register consistency. Additionally, I contributed to LLM output evaluation, focusing on truthfulness, clarity, neutrality, and relevance across diverse domains (e.g., marketing, education, general knowledge). This included writing and refining prompts based on user intent (e.g., open QA, closed QA, generation, brainstorming) and rating model outputs in line with strict quality guidelines. Projects followed rigorous accuracy and consistency metrics, with multi-rater agreement protocols and daily feedback loops.