Global Speech Data Collection & Annotation for Multi-Language Conversational AI
Currently executing a large-scale, multi-language data collection and annotation project for a leading cloud provider (AWS) to expand their Conversational AI capabilities across 30+ global languages. The project scope involves delivering a target of 40,000 high-quality speech utterances per language. Our team is managing the end-to-end global workflow: Global Recruitment: Actively sourcing, vetting, and managing a diverse network of native speakers across dozens of countries to ensure authentic dialectal and demographic representation. Multi-Language Data Collection: Coordinating the ongoing creation and recording of utterances focused on 10 specific Inverse Text Normalization (ITN) categories (e.g., currencies, dates, addresses). Localized Annotation: Deploying language-specific teams to perform dual-layer annotation for each audio file: a precise verbatim transcription and a fully normalized ITN version. Centralized Quality Assurance: We have implemented a scalable QA framework