LLM Evaluation, Coding Tasks, and Multilingual AI Annotator & Transcription
Kolkata, India
$10.00/hrIntermediateCVATDoccanoLabelbox
Key Skills
Software
CVAT
Doccano
Labelbox
Prodigy
Appen
Data Annotation Tech
Scale AI
Top Subject Matter
No subject matter listed
Top Data Types
Computer Code Programming
Image
Text
Top Task Types
Bounding Box
Classification
Computer Programming/Coding
Evaluation/Rating
Translation/Localization
Freelancer Overview
An experienced Multilingual AI Data Annotator and LLM Evaluation Specialist with over three years of expertise creating, curating, and validating high-quality training data for NLP, speech-recognition, and STEM learning systems. I’ve labeled and vetted 120,000+ sentences in English, Hindi, and Bengali using Labelbox and Prodigy; developed Python scripts and ETL pipelines to automate QA checks and debug data flows—boosting annotation consistency by 30%; design and execute evaluation protocols for cutting-edge LLMs, crafting prompt-based test suites and scoring outputs against accuracy, bias, and relevance metrics; and deliver 98%+ word-accuracy transcripts for podcasts, interviews, and focus groups as a transcription editor. I author comprehensive annotation guidelines and conduct code reviews in Python, JavaScript, SQL, and Bash. System Administration expertise spans Linux servers, Docker containers, and AWS infrastructure, while as a software and web development specialist, I architect scalable applications with React, Node.js, and Next.js, and write complex SQL queries. I also develop and deliver AI training programs in coding, computer science, and STEM, designing hands-on modules and interactive assessments. Passionate about empowering AI with reliable, multilingual data, I thrive in fast-paced remote settings, continually refining processes, optimizing scripts, and driving measurable gains in model performance.
IntermediateEnglishHindiBengali
Labeling Experience
RLHF Code Annotation & Review
Data Annotation TechComputer Code ProgrammingClassificationComputer Programming Coding
I reviewed 500+ Python, JavaScript, and C++ snippets plus their natural-language instructions for an instruction-following LLM. Tasks included classifying common bug types, flagging insecure patterns, writing cleaner reference solutions, and scoring model outputs for correctness, readability, and style. My annotations fed directly into RLHF fine-tuning and improved compile-success rates by 22 %. Maintained a 99 % audit-pass rate and provided periodic rubric feedback to tooling engineers.
I reviewed 500+ Python, JavaScript, and C++ snippets plus their natural-language instructions for an instruction-following LLM. Tasks included classifying common bug types, flagging insecure patterns, writing cleaner reference solutions, and scoring model outputs for correctness, readability, and style. My annotations fed directly into RLHF fine-tuning and improved compile-success rates by 22 %. Maintained a 99 % audit-pass rate and provided periodic rubric feedback to tooling engineers.
2023
Text Data Labeling Specialist
Scale AITextEntity Ner ClassificationText Generation
Worked on multiple projects focused on text data annotation, including content moderation, intent classification, sentiment analysis, and entity extraction. Responsible for reviewing, labeling, and categorizing large volumes of text data to train and validate machine learning models. Ensured high accuracy by following detailed guidelines and maintaining consistency across tasks. Collaborated with project managers and QA teams to improve labeling standards and deliver quality datasets within tight deadlines.
Worked on multiple projects focused on text data annotation, including content moderation, intent classification, sentiment analysis, and entity extraction. Responsible for reviewing, labeling, and categorizing large volumes of text data to train and validate machine learning models. Ensured high accuracy by following detailed guidelines and maintaining consistency across tasks. Collaborated with project managers and QA teams to improve labeling standards and deliver quality datasets within tight deadlines.
2024 - 2025
Audio Data Labeler
AppenAudioAudio Recording
Marked speaker turns and non-speech sounds (e.g., laughter, coughs) with precise timestamps on raw audio clips to ensure clean segmentation for downstream tasks.
Reviewed and corrected machine-generated transcripts against original recordings, fixing misheard words and aligning text to timecodes per project style guide.
Collaborated with the annotation lead to refine labeling instructions, boosting consistency across the dataset and reducing revision cycles.
Marked speaker turns and non-speech sounds (e.g., laughter, coughs) with precise timestamps on raw audio clips to ensure clean segmentation for downstream tasks.
Reviewed and corrected machine-generated transcripts against original recordings, fixing misheard words and aligning text to timecodes per project style guide.
Collaborated with the annotation lead to refine labeling instructions, boosting consistency across the dataset and reducing revision cycles.
2024 - 2024
Financial Document OCR and Key-Value Annotation
AppenDocumentEntity Ner ClassificationClassification
Extracted key fields (invoice number, date, vendor, line-item totals) from 15,000+ PDF invoices. Built a dual-review workflow with automated regex checks that drove data accuracy to 97% and cut manual QC time by 40%.
Extracted key fields (invoice number, date, vendor, line-item totals) from 15,000+ PDF invoices. Built a dual-review workflow with automated regex checks that drove data accuracy to 97% and cut manual QC time by 40%.
2023 - 2023
Urban Infrastructure Mapping
CVATGeospatial Tiled ImageryPolygonSegmentation
Annotated 5,000 km^2 of tiled satellite imagery to outline road networks, building footprints, and green spaces. Built an automated tiling pipeline to evenly distribute work and a QA script that checked polygon overlaps, boosting annotation throughput by 40% while maintaining 98% mean IoU.
Annotated 5,000 km^2 of tiled satellite imagery to outline road networks, building footprints, and green spaces. Built an automated tiling pipeline to evenly distribute work and a QA script that checked polygon overlaps, boosting annotation throughput by 40% while maintaining 98% mean IoU.