Ahmed Hani - AI & Automation Specialist - Data Annotation & Model Evaluation

Key Skills

Software

Appen

Scale AI

Internal/Proprietary Tooling

Label Studio

Top Subject Matter

No subject matter listed

Top Data Types

Audio

Document

Image

Top Label Types

Bounding Box

Polygon

Point Key Point

Object Detection

Tracking

Classification

Text Generation

Emotion Recognition

Text Summarization

Question Answering

RLHF

Freelancer Overview

I am an AI data and evaluation specialist with hands-on experience across computer vision, NLP, and LLM-based systems, focusing on high-quality data annotation, QA-driven evaluation, and structured scenario design. My work spans large-scale dataset labeling (bounding boxes, polygons, keypoints), NLP annotation (sentiment analysis, entity recognition), and systematic evaluation of LLM outputs for correctness, safety, and instruction-following. I have practical experience designing and reviewing evaluation workflows, defining gold-standard behaviors, documenting edge cases, and applying scoring logic to assess model and agent performance. This includes prompt evaluation, RLHF-style feedback, bias detection, and consistency checks to ensure reliable and reproducible AI training data. On the technical side, I am proficient in Python and comfortable working with structured formats such as JSON and YAML for evaluation specifications and tooling. I have used ML and deployment platforms like Hugging Face, and annotation/evaluation platforms including Appen, Remotasks, Toloka, and Clickworker. I am also familiar with backend and tooling concepts such as API-based workflows, basic debugging, and automation support for AI evaluation pipelines. My academic background in physical & natural sciences and mechatronics provides strong analytical thinking, experimental rigor, and problem-solving skills, which I apply to AI evaluation, data quality assurance, and agent testing tasks. I am detail-oriented, guideline-driven, and comfortable collaborating with distributed teams to refine datasets, evaluation frameworks, and testing methodologies. Overall, I bring a balanced combination of technical understanding, QA mindset, and applied AI experience, enabling me to contribute effectively to AI training, evaluation, annotation, and software-adjacent roles across diverse projects.

ExpertPolishEnglishRussianGermanArabic

Labeling Experience

CI/CD-Style Evaluation Pipelines for AI and LLM Systems

Internal Proprietary ToolingComputer Code ProgrammingRLHFEvaluation Rating

I worked on designing and maintaining CI/CD-style evaluation pipelines for machine learning and LLM-based systems, where evaluations are automatically triggered on model, prompt, or code changes. My responsibilities included structuring versioned evaluation suites, integrating regression and benchmark tests, and defining pass/fail gates based on predefined metrics and thresholds. I helped organize evaluation artifacts (scenarios, scoring rules, baselines) in a reproducible manner, and supported automated execution using Dockerized environments to ensure consistent dependency and runtime behavior. Evaluation results were reviewed to detect regressions, performance drift, or safety issues before changes were promoted. This workflow enabled continuous validation of model behavior, faster feedback cycles, and higher confidence in iterative model improvements.

2025

Regression Testing & Performance Drift Detection for AI and LLM Systems

Internal Proprietary ToolingComputer Code ProgrammingClassificationRLHF

I conducted regression testing for ML and LLM-based systems to ensure new model versions or prompt updates did not degrade previously correct behaviors. My work involved defining baseline (golden) test sets, running repeated evaluations across model iterations, and comparing results to detect regressions in correctness, safety, and instruction-following. I tracked changes in key metrics (pass/fail rates, weighted scores, error categories), identified regression patterns, and documented root causes such as prompt changes, tool-use failures, or logic drift. Findings were used to gate releases, refine evaluation coverage, and prioritize fixes for high-impact regressions. This process emphasized reproducibility, consistent scoring logic, and clear reporting to support reliable model iteration and continuous improvement.

2025

Retrieval-Augmented Generation (RAG) Evaluation & QA for LLM Systems

Internal Proprietary ToolingTextClassificationQuestion Answering

I worked on evaluation and QA workflows for Retrieval-Augmented Generation (RAG) systems, focusing on assessing the full pipeline from document retrieval to final answer generation. My responsibilities included evaluating retrieval quality (relevance, coverage, ranking) and generation quality (correctness, faithfulness to sources, completeness). I reviewed whether generated answers were grounded in retrieved documents, identified hallucinations or unsupported claims, and flagged missing or weak evidence. I also evaluated citation alignment, answer usefulness, and failure modes such as partial retrieval, irrelevant context injection, or over-reliance on incorrect sources. Structured scoring and clear justification were applied to provide reliable feedback for model tuning, benchmark evaluation, and continuous improvement of RAG pipelines.

2024

Search Relevance & Ranking Evaluation for AI and LLM Systems

Internal Proprietary ToolingTextClassificationQuestion Answering

I worked on search relevance and ranking evaluation tasks for AI-powered search and retrieval systems. My responsibilities included assessing how well retrieved documents, passages, or generated answers matched user queries, and assigning graded relevance scores based on predefined guidelines. I evaluated outputs across dimensions such as topical relevance, completeness, factual correctness, and usefulness, and identified failure cases including partial relevance, off-topic results, and hallucinated content. For ranked results, I reviewed ordering quality and documented ranking errors where more relevant items were placed lower. This work supported improving retrieval quality for LLM-powered search and RAG systems, providing reliable human feedback signals for model tuning and evaluation.

2024

Audio, Speech & ASR Data Labeling and Quality Evaluation for AI Systems

Internal Proprietary ToolingAudioEvaluation RatingAudio Recording

I worked on audio and speech data labeling projects supporting automatic speech recognition (ASR) and conversational AI systems. My responsibilities included transcribing spoken audio into accurate text, reviewing ASR outputs, and correcting errors related to pronunciation, accents, background noise, and overlapping speech. I also evaluated audio and transcription quality by identifying misrecognitions, segmentation issues, and timing inconsistencies, and applied structured ratings to assess transcription accuracy and usability. Tasks required strict adherence to transcription guidelines, handling diverse audio conditions, and maintaining consistency across large datasets. In addition, I labeled and reviewed emotional tone and speaker intent in speech samples to support downstream tasks such as emotion recognition and dialog quality evaluation.

2024

Education

H

Helwan University

Bachelor of Science, Physical and Natural Sciences (Microbiology and Biochemistry)

Bachelor of Science

2023 - 2025

H

Helwan University

Bachelor of Science, Microbiology and Biochemistry

Bachelor of Science

2023

Work History

S

STEMYA

Summer Intern

Giza

2024 - 2024

S

Stemya

Intern

cairo

2024 - 2024