Instruction-Tuning Dataset for Code Generation Models
The scope of this project was to create a high-quality, instruction-tuning dataset to improve the code generation capabilities of a large language model.
Hire this AI Trainer
Sign in or create an account to invite AI Trainers to your job.
No subject matter listed
My expertise lies at the intersection of data labeling and software engineering, with a specialized focus on creating high-quality training data for code intelligence and large language models. I am proficient in the full pipeline, from designing annotation guidelines and managing labeling teams to building custom tools with Python, AST parsers, and libraries like Tree-sitter to automate and scale the data generation process. My work is grounded in a data-centric AI philosophy, aiming to systematically improve model performance through meticulously curated datasets. What sets me apart is my deep technical ability to understand and label complex, structured data like source code. I don't just annotate text; I engineer datasets for specific learning objectives, such as code summarization, bug detection, and program synthesis. My research and projects, including developing novel labeling pipelines for instruction-following data and improving code search via contrastive learning, demonstrate a proven track record of creating data that directly enhances model capabilities on challenging, real-world tasks.
The scope of this project was to create a high-quality, instruction-tuning dataset to improve the code generation capabilities of a large language model.
Master of Science in Computer Science , Computer science
AI Engineering Intern