Data Processing and Text Classification for Digitization (IIT Bombay Internship)
Built and automated OCR workflows to extract, clean, and structure text from large volumes of scanned historical manuscripts. Applied data preprocessing and normalization to create datasets suitable for downstream AI applications. Validated and processed extracted content to ensure high accuracy for research purposes. • Created machine-readable datasets from noisy document sources • Automated pipeline for text digitization using Python-based tools • Performed quality checks on labeled data to meet research standards • Enabled search and retrieval on digitized content