For employers

Hire this AI Trainer

Sign in or create an account to invite AI Trainers to your job.

Invite to Job
Wasil Hassan Baig

Wasil Hassan Baig

AI Trainer & Expert in Data Labeling

USA flagSeattle, Usa
$16.00/hrExpertAppenArgillaData Annotation Tech

Key Skills

Software

AppenAppen
ArgillaArgilla
Data Annotation TechData Annotation Tech
MindriftMindrift
RemotasksRemotasks
TelusTelus
Other
Scale AIScale AI
Internal/Proprietary Tooling

Top Subject Matter

No subject matter listed

Top Data Types

AudioAudio
ImageImage
TextText

Top Task Types

Action Recognition
Classification
Data Collection
Diagnosis
Fine Tuning

Freelancer Overview

I am an experienced AI Trainer and Data Labeling Specialist with a strong background in annotating and curating high-quality training data for machine learning models. My expertise spans natural language processing (NLP), audio annotation, and data labeling. I have worked on projects involving LLM evaluation, sentiment analysis, intent classification, and linguistic annotation across various platforms. In audio tasks, I’ve contributed to training speech recognition systems by labeling voice characteristics such as accent, age, emotion, and content summaries. With a solid foundation in data management and analysis, I bring attention to detail, consistency, and a deep understanding of how quality data drives AI performance.

ExpertEnglish

Labeling Experience

Prompt Quality Assessment for Factuality Benchmark – AI Evaluation

Internal Proprietary ToolingTextSegmentationClassification
This project involved evaluating the factual quality and suitability of user-generated prompts intended for benchmarking AI model accuracy. I reviewed and categorized prompts based on a detailed taxonomy of errors, including hallucinations, hypotheticals, non-factual requests, non-answerable queries, safety violations, ambiguity, and subjectivity. The task required distinguishing between prompts to reject outright and those that could be revised while ensuring alignment with project goals—focusing only on prompts that seek factual information. The role demanded critical thinking, linguistic analysis, and careful judgment to help build a high-quality dataset for factual AI evaluation.

This project involved evaluating the factual quality and suitability of user-generated prompts intended for benchmarking AI model accuracy. I reviewed and categorized prompts based on a detailed taxonomy of errors, including hallucinations, hypotheticals, non-factual requests, non-answerable queries, safety violations, ambiguity, and subjectivity. The task required distinguishing between prompts to reject outright and those that could be revised while ensuring alignment with project goals—focusing only on prompts that seek factual information. The role demanded critical thinking, linguistic analysis, and careful judgment to help build a high-quality dataset for factual AI evaluation.

2025 - 2025

Language and Dialect Evaluation – Model Comparison for LID Accuracy

Internal Proprietary ToolingAudioClassificationDiagnosis
In this project, I evaluated the performance of two language identification (LID) models by comparing their dialect predictions for short text samples. Using provided metadata and contextual references, I assessed whether each model correctly identified the dialect and language of the “cleaned text.” My task involved assigning final labels for the correct dialect and language and selecting the appropriate evaluation outcome for each model's prediction using dropdown fields with built-in data validation. The role required strong multilingual proficiency (especially in English and the V1 dialect), attention to linguistic nuance, and adherence to updated annotation guidelines, including special handling of named entities considered monolingual.

In this project, I evaluated the performance of two language identification (LID) models by comparing their dialect predictions for short text samples. Using provided metadata and contextual references, I assessed whether each model correctly identified the dialect and language of the “cleaned text.” My task involved assigning final labels for the correct dialect and language and selecting the appropriate evaluation outcome for each model's prediction using dropdown fields with built-in data validation. The role required strong multilingual proficiency (especially in English and the V1 dialect), attention to linguistic nuance, and adherence to updated annotation guidelines, including special handling of named entities considered monolingual.

2025 - 2025

Pairwise Audio Output Evaluation – Span-Prompted Sound Extraction

Internal Proprietary ToolingAudioSegmentationClassification
This project focused on evaluating machine learning models designed to isolate specific sound events from audio mixtures based on predefined time spans. Using the SRT Halo Tool, I was responsible for performing pairwise subjective evaluations of model outputs by comparing two extracted audio clips against a visualized input (spectrogram and energy plot). The target sound was identified by yellow-highlighted time spans, and outputs were judged based on their ability to isolate that sound accurately while excluding non-target audio. The role required strong auditory discrimination, attention to temporal markers, and understanding of audio spectrograms to assess output quality in line with strict guidelines.

This project focused on evaluating machine learning models designed to isolate specific sound events from audio mixtures based on predefined time spans. Using the SRT Halo Tool, I was responsible for performing pairwise subjective evaluations of model outputs by comparing two extracted audio clips against a visualized input (spectrogram and energy plot). The target sound was identified by yellow-highlighted time spans, and outputs were judged based on their ability to isolate that sound accurately while excluding non-target audio. The role required strong auditory discrimination, attention to temporal markers, and understanding of audio spectrograms to assess output quality in line with strict guidelines.

2025 - 2025

Procedural Video Sequence Annotation – WorldPrediction Benchmark

Internal Proprietary ToolingVideoSegmentationRelationship
This project focused on evaluating AI understanding of procedural planning through visual reasoning. Using the Squirrel platform, I was tasked with analyzing a set of 3 to 10 short video clips paired with an initial and final image. The goal was to determine which one of four possible video sequences accurately depicted the logical progression of actions transforming the initial state into the final state. Each correct sequence required careful attention to the actions performed, disregarding visual inconsistencies in the background or setting. The task demanded strong visual analysis skills, attention to detail, and an understanding of causality in everyday tasks to ensure accurate sequence predictions.

This project focused on evaluating AI understanding of procedural planning through visual reasoning. Using the Squirrel platform, I was tasked with analyzing a set of 3 to 10 short video clips paired with an initial and final image. The goal was to determine which one of four possible video sequences accurately depicted the logical progression of actions transforming the initial state into the final state. Each correct sequence required careful attention to the actions performed, disregarding visual inconsistencies in the background or setting. The task demanded strong visual analysis skills, attention to detail, and an understanding of causality in everyday tasks to ensure accurate sequence predictions.

2025 - 2025

Video Physics Plausibility Rating – Intuitive Physics Benchmark

Internal Proprietary ToolingVideoEntity Ner ClassificationObject Detection
This project aimed to establish a human baseline for evaluating the physical plausibility of simulated videos, supporting AI development in intuitive physics reasoning. Using the Squirrel annotation platform, I watched short videos generated by physics simulation engines and rated the realism of object interactions. Each video was assessed based on how plausible the motion and behavior of objects were in a real-world context, selecting from four graded options ranging from "Completely Implausible" to "Completely Plausible." The task required high attention to detail, critical thinking, and consistency in judgment across hundreds of 10-second clips.

This project aimed to establish a human baseline for evaluating the physical plausibility of simulated videos, supporting AI development in intuitive physics reasoning. Using the Squirrel annotation platform, I watched short videos generated by physics simulation engines and rated the realism of object interactions. Each video was assessed based on how plausible the motion and behavior of objects were in a real-world context, selecting from four graded options ranging from "Completely Implausible" to "Completely Plausible." The task required high attention to detail, critical thinking, and consistency in judgment across hundreds of 10-second clips.

2025 - 2025

Education

B

Bellevue College

Bachelor of Applied Science, Data Management & Analysis, Data Analytics

Bachelor of Applied Science
2020 - 2024
B

Brainnest

Industry Training Program, Data Analysis

Industry Training Program
2023 - 2023

Work History

R

RWS Group

Data Annotator

Remote
2024 - Present
W

Women in Localization

Data Analyst, Operations & Metrics

California
2024 - 2025