Web Scraping & Data Engineering - Data Preparation for LLM Training
Cleaned and structured raw web data into high-quality datasets for LLM fine-tuning. Developed automated Python scripts for large-scale data extraction focused on producing text suitable for use in AI model training. Collaborated with clients to identify key data requirements and ensure adherence to quality standards. • Automated data collection using Python, Selenium, and Playwright. • Delivered cleaned JSON/CSV datasets for AI workflows. • Focused on optimizing dataset structure for LLM performance. • Troubleshot data issues to meet modeling needs.