Document Collection and Annotation Specialist
I worked on building a high-quality ground-truth dataset from real-world financial documents (PDF, XLS, and images). The project involved collecting and anonymizing ~10 heterogeneous client documents (account statements, portfolio summaries, loan/credit documents), then extracting key financial information such as assets, liabilities, account/holding details, balances, currencies, dates, and client identifiers. Using an internal annotation tool and a provided JSON schema, I converted unstructured documents into clean, structured JSON, focusing on consistency across files (field naming, units, currency codes, date formats). I implemented quality checks with custom Python scripts (JSON schema validation, missing/invalid fields detection) and maintained a tag/notes system to explicitly flag ambiguous, incomplete, or conflicting information. The final output was a set of document + JSON “correction” pairs designed to benchmark and regress-test LLMs.