I am an AI data and evaluation specialist with hands-on experience across computer vision, NLP, and LLM-based systems, focusing on high-quality data annotation, QA-driven evaluation, and structured scenario design. My work spans large-scale dataset labeling (bounding boxes, polygons, keypoints), NLP annotation (sentiment analysis, entity recognition), and systematic evaluation of LLM outputs for correctness, safety, and instruction-following.
I have practical experience designing and reviewing evaluation workflows, defining gold-standard behaviors, documenting edge cases, and applying scoring logic to assess model and agent performance. This includes prompt evaluation, RLHF-style feedback, bias detection, and consistency checks to ensure reliable and reproducible AI training data.
On the technical side, I am proficient in Python and comfortable working with structured formats such as JSON and YAML for evaluation specifications and tooling. I have used ML and deployment platforms like Hugging Face, and annotation/evaluation platforms including Appen, Remotasks, Toloka, and Clickworker. I am also familiar with backend and tooling concepts such as API-based workflows, basic debugging, and automation support for AI evaluation pipelines.
My academic background in physical & natural sciences and mechatronics provides strong analytical thinking, experimental rigor, and problem-solving skills, which I apply to AI evaluation, data quality assurance, and agent testing tasks. I am detail-oriented, guideline-driven, and comfortable collaborating with distributed teams to refine datasets, evaluation frameworks, and testing methodologies.
Overall, I bring a balanced combination of technical understanding, QA mindset, and applied AI experience, enabling me to contribute effectively to AI training, evaluation, annotation, and software-adjacent roles across diverse projects.
ExpertPolishEnglishRussianGermanArabic
Labeling Experience
CI/CD-Style Evaluation Pipelines for AI and LLM Systems
I worked on designing and maintaining CI/CD-style evaluation pipelines for machine learning and LLM-based systems, where evaluations are automatically triggered on model, prompt, or code changes. My responsibilities included structuring versioned evaluation suites, integrating regression and benchmark tests, and defining pass/fail gates based on predefined metrics and thresholds.
I helped organize evaluation artifacts (scenarios, scoring rules, baselines) in a reproducible manner, and supported automated execution using Dockerized environments to ensure consistent dependency and runtime behavior. Evaluation results were reviewed to detect regressions, performance drift, or safety issues before changes were promoted.
This workflow enabled continuous validation of model behavior, faster feedback cycles, and higher confidence in iterative model improvements.
I worked on designing and maintaining CI/CD-style evaluation pipelines for machine learning and LLM-based systems, where evaluations are automatically triggered on model, prompt, or code changes. My responsibilities included structuring versioned evaluation suites, integrating regression and benchmark tests, and defining pass/fail gates based on predefined metrics and thresholds.
I helped organize evaluation artifacts (scenarios, scoring rules, baselines) in a reproducible manner, and supported automated execution using Dockerized environments to ensure consistent dependency and runtime behavior. Evaluation results were reviewed to detect regressions, performance drift, or safety issues before changes were promoted.
This workflow enabled continuous validation of model behavior, faster feedback cycles, and higher confidence in iterative model improvements.
2025
Regression Testing & Performance Drift Detection for AI and LLM Systems
I conducted regression testing for ML and LLM-based systems to ensure new model versions or prompt updates did not degrade previously correct behaviors. My work involved defining baseline (golden) test sets, running repeated evaluations across model iterations, and comparing results to detect regressions in correctness, safety, and instruction-following.
I tracked changes in key metrics (pass/fail rates, weighted scores, error categories), identified regression patterns, and documented root causes such as prompt changes, tool-use failures, or logic drift. Findings were used to gate releases, refine evaluation coverage, and prioritize fixes for high-impact regressions.
This process emphasized reproducibility, consistent scoring logic, and clear reporting to support reliable model iteration and continuous improvement.
I conducted regression testing for ML and LLM-based systems to ensure new model versions or prompt updates did not degrade previously correct behaviors. My work involved defining baseline (golden) test sets, running repeated evaluations across model iterations, and comparing results to detect regressions in correctness, safety, and instruction-following.
I tracked changes in key metrics (pass/fail rates, weighted scores, error categories), identified regression patterns, and documented root causes such as prompt changes, tool-use failures, or logic drift. Findings were used to gate releases, refine evaluation coverage, and prioritize fixes for high-impact regressions.
This process emphasized reproducibility, consistent scoring logic, and clear reporting to support reliable model iteration and continuous improvement.
2025
Retrieval-Augmented Generation (RAG) Evaluation & QA for LLM Systems
I worked on evaluation and QA workflows for Retrieval-Augmented Generation (RAG) systems, focusing on assessing the full pipeline from document retrieval to final answer generation. My responsibilities included evaluating retrieval quality (relevance, coverage, ranking) and generation quality (correctness, faithfulness to sources, completeness).
I reviewed whether generated answers were grounded in retrieved documents, identified hallucinations or unsupported claims, and flagged missing or weak evidence. I also evaluated citation alignment, answer usefulness, and failure modes such as partial retrieval, irrelevant context injection, or over-reliance on incorrect sources.
Structured scoring and clear justification were applied to provide reliable feedback for model tuning, benchmark evaluation, and continuous improvement of RAG pipelines.
I worked on evaluation and QA workflows for Retrieval-Augmented Generation (RAG) systems, focusing on assessing the full pipeline from document retrieval to final answer generation. My responsibilities included evaluating retrieval quality (relevance, coverage, ranking) and generation quality (correctness, faithfulness to sources, completeness).
I reviewed whether generated answers were grounded in retrieved documents, identified hallucinations or unsupported claims, and flagged missing or weak evidence. I also evaluated citation alignment, answer usefulness, and failure modes such as partial retrieval, irrelevant context injection, or over-reliance on incorrect sources.
Structured scoring and clear justification were applied to provide reliable feedback for model tuning, benchmark evaluation, and continuous improvement of RAG pipelines.
2024
Search Relevance & Ranking Evaluation for AI and LLM Systems
I worked on search relevance and ranking evaluation tasks for AI-powered search and retrieval systems. My responsibilities included assessing how well retrieved documents, passages, or generated answers matched user queries, and assigning graded relevance scores based on predefined guidelines.
I evaluated outputs across dimensions such as topical relevance, completeness, factual correctness, and usefulness, and identified failure cases including partial relevance, off-topic results, and hallucinated content. For ranked results, I reviewed ordering quality and documented ranking errors where more relevant items were placed lower.
This work supported improving retrieval quality for LLM-powered search and RAG systems, providing reliable human feedback signals for model tuning and evaluation.
I worked on search relevance and ranking evaluation tasks for AI-powered search and retrieval systems. My responsibilities included assessing how well retrieved documents, passages, or generated answers matched user queries, and assigning graded relevance scores based on predefined guidelines.
I evaluated outputs across dimensions such as topical relevance, completeness, factual correctness, and usefulness, and identified failure cases including partial relevance, off-topic results, and hallucinated content. For ranked results, I reviewed ordering quality and documented ranking errors where more relevant items were placed lower.
This work supported improving retrieval quality for LLM-powered search and RAG systems, providing reliable human feedback signals for model tuning and evaluation.
2024
Audio, Speech & ASR Data Labeling and Quality Evaluation for AI Systems
I worked on audio and speech data labeling projects supporting automatic speech recognition (ASR) and conversational AI systems. My responsibilities included transcribing spoken audio into accurate text, reviewing ASR outputs, and correcting errors related to pronunciation, accents, background noise, and overlapping speech.
I also evaluated audio and transcription quality by identifying misrecognitions, segmentation issues, and timing inconsistencies, and applied structured ratings to assess transcription accuracy and usability. Tasks required strict adherence to transcription guidelines, handling diverse audio conditions, and maintaining consistency across large datasets.
In addition, I labeled and reviewed emotional tone and speaker intent in speech samples to support downstream tasks such as emotion recognition and dialog quality evaluation.
I worked on audio and speech data labeling projects supporting automatic speech recognition (ASR) and conversational AI systems. My responsibilities included transcribing spoken audio into accurate text, reviewing ASR outputs, and correcting errors related to pronunciation, accents, background noise, and overlapping speech.
I also evaluated audio and transcription quality by identifying misrecognitions, segmentation issues, and timing inconsistencies, and applied structured ratings to assess transcription accuracy and usability. Tasks required strict adherence to transcription guidelines, handling diverse audio conditions, and maintaining consistency across large datasets.
In addition, I labeled and reviewed emotional tone and speaker intent in speech samples to support downstream tasks such as emotion recognition and dialog quality evaluation.
2024
Education
H
Helwan University
Bachelor of Science, Physical and Natural Sciences (Microbiology and Biochemistry)
Bachelor of Science
2023 - 2025
H
Helwan University
Bachelor of Science, Microbiology and Biochemistry