Environmental HRMS Data Annotation & AI-Ready Dataset Development
Led the annotation, structuring, and validation of large-scale environmental datasets generated from high-resolution mass spectrometry (HRMS) across soil, wastewater, and agricultural systems. The project involved transforming raw, noisy analytical outputs into AI-ready datasets through systematic labeling of chemical entities, classification of compounds, and assignment of confidence levels (e.g., Schymanski identification framework). Developed Python-based workflows (pandas, NumPy, scikit-learn) to standardise multi-source data, remove inconsistencies, and ensure reproducibility. Implemented quality control pipelines including tolerance thresholds (e.g., mass accuracy filtering), duplicate handling, and cross-dataset validation. Annotated over 500+ contaminants across multiple environmental matrices, integrating metadata such as physicochemical properties and detection frequencies. A key component of the work involved aligning observed environmental concentrations (MECs) with predicted model outputs (PECs), requiring careful feature engineering, edge-case handling, and validation logic—directly supporting downstream machine learning applications. All workflows were designed to meet high standards of traceability and auditability, consistent with GLP-aligned environments.