Samuel Wanjiru - ML Engineer – LLM Dataset Curation and Fine-tuning for Code Generation

Key Skills

Software

Axiom AI

Data Annotation Tech

Encord

Lionbridge

Mercor

Micro1

Top Subject Matter

Audio DSP

Large Language Models

Code Generation

Top Data Types

Audio

Text

Document

Top Task Types

Prompt Response Writing SFT

Data Collection

RLHF

Fine Tuning

Freelancer Overview

ML Engineer – LLM Dataset Curation and Fine-tuning for Code Generation. Brings 4+ years of professional experience across legal operations, contract review, compliance, and structured analysis. Core strengths include Internal and Proprietary Tooling. AI-training focus includes data types such as Computer Code, Programming, and Text and labeling workflows including Fine-tuning and Data Collection.

ExpertEnglish

Labeling Experience

Open-Source Data Labeling Contributor (Distilabel)

Data Collection

I contributed to open-source data labeling tooling through active participation in the Distilabel project. I enhanced dataset curation capabilities to improve LLM data workflows. These improvements impacted quality and reliability of domain-specific training datasets. • Submitted multiple PRs focused on data extraction and dataset curation functions. • Designed and tested enhancements to filtering algorithms in an open-source context. • Supported documentation and usability for new annotation and QA tools. • Evaluated best practices in open-source data labeling.

2024 - Present

ML Engineer – RLHF for Code Models

RLHF

I implemented reinforcement learning from human feedback (RLHF) strategies to facilitate LLM alignment for code generation tasks in audio DSP domains. This involved generating human preference data and using reward models for iterative model improvement. The process enhanced the domain specificity and accuracy of LLM outputs for specialized programming applications. • Collected and labeled human responses and preferences for code model outputs. • Applied iterative RLHF fine-tuning pipelines on Qwen3-Coder and CodeLlama models. • Used open-source LLMs to validate reward model performance. • Developed documentation for RLHF pipeline best practices and reproducibility.

2024 - Present

ML Engineer – Dataset & Fine-tuning

Prompt Response Writing SFTFine Tuning

I developed supervised fine-tuning datasets specializing in code instruction and response pairs for LLMs. My work focused on assembling, cleaning, and validating training data for specialized code models through curation of high-quality instruction-response pairs. I implemented automated quality filtering and LLM-based validation to ensure dataset reliability and optimization for fine-tuning tasks. • Curated, extracted, and labeled 2,000+ instruction-response pairs from open-source code repositories and technical documentation. • Used Distilabel and internal LLM tools for synthetic data generation and validation. • Applied semantic deduplication and BLEU-score based multi-pass filtering. • Established and documented best practices for domain-specific data curation and validation.

2024 - Present

Graduate Research Assistant – AI & NLP

TextData Collection

Led a team extracting and structuring Q&A pairs on domain-specific audio DSP topics for research and LLM application. Developed data cleaning pipelines, validated data quality, and generated synthetic code Q&A pairs leveraging LLMs. Published and documented structured datasets for precision code classification and AI training. • Extracted and structured over 3,000 Q&A text-code pairs from Stack Overflow and GitHub Gists. • Applied semantic deduplication, code syntax validation, and LLM-powered synthetic generation. • Exported labeled datasets in JSON format for use in training and evaluation tasks. • Attained 92% precision on dataset used for code classification research.

2023 - 2024

Audio DSP Code Generation Fine-tuning Dataset (Master’s Thesis)

Data Collection

I curated a domain-specific code generation training dataset by sourcing, extracting, and preparing labeled instruction-response examples from multiple open-source repositories. The focus was on creating high-quality pairs applicable to audio DSP model fine-tuning tasks. Thorough validation and deduplication algorithms were applied at multiple curation stages. • Collected 5,000+ code samples and engineered synthetic data with targeted prompts via the Claude API. • Implemented staged quality filtering using LLM-based validation, BLEU scoring, and semantic deduplication. • Focused on audio DSP code patterns such as processBlock, AudioBuffer, and filter implementation. • Achieved improved model accuracy for DSP code generation benchmarks presented at MLOps Africa 2024.

2023 - 2023

Education

U

University of Nairobi

Master of Science, Machine Learning and Audio Signal Processing

Master of Science

2022 - 2024

K

Kenyatta University

Bachelor of Science, Computer Science

Bachelor of Science

2018 - 2022

Work History

I

Independent Contractor

ML Engineer – Web Scraping and Dataset Curation

Nairobi

2024 - Present

I

Independent Contractor

ML Engineer – Dataset & Fine-tuning

Nairobi

2024 - Present