Audio annotation
The project involves labeling background noise, assistant speech, and user speech. Each word (token) must be annotated with a temporal precision of 1–5 milliseconds. In addition to labeling, we also need to perform judgment tasks - identifying whether the user’s speech occurs as an interruption, standard response, or acknowledgment, regardless of whether we transcribe the speech verbatim.