AUDIO + SPEECH ANNOTATION
Your model is only as accurate as the transcript it learned from
Human transcript error rates run under 2% where production ASR sits near 10%. Transcription, diarization, forced alignment, events, emotion, and intent for ASR, voice, and speech-LLM teams. 150+ languages, kappa-gated, EEA-resident.
- 150+ languages
- 100% human QA
- EEA-resident
ANNOTATION, NOT COLLECTION
You bring the audio. We return the labels.
Annotation of speech you already hold: call-center archives, meeting recordings, in-cabin and field audio, voice-assistant logs. If you still need to capture or record the speech itself, that is a collection engagement, and our collection teams run it separately.
You bring
- Call-center recordings
- Meetings and interviews
- Broadcast and media
- Field and in-cabin audio
- Voice-assistant logs
We return
- Verbatim and clean transcripts TextGrid / JSON
- Speaker turns (diarization) RTTM
- Word and phone timestamps CTM
- Sound and acoustic events CSV
- Intent, slots, and entities JSON
- GDPR Articles 6, 7, 9
- EU AI Act Article 10
- EEA-resident, Norway
- 30-day erasure SLA
WHAT WE ANNOTATE
Every layer of the signal, on one time axis
Nine annotation layers over the same audio: what was said (transcription, forced alignment, language ID, named entities), who said it (diarization, emotion), and what else the model must ignore or act on (voice activity, sound events, intent and slots). One clip, every layer, time-aligned.
Beyond the clip: dataset services for speech-LLM and governance
- RLHF and preference data Response rating, preference comparison, and human evaluation for speech and audio model outputs.
- Spoken-QA and instruction sets Spoken question-answer pairs and instruction-response data for speech-LLM and audio-foundation training.
- PII redaction and de-identification Personal and special-category spans marked for removal, with a re-identification review on the result.
- Anti-spoofing and synthetic-audio labels Real-versus-synthetic, replay, and voice-conversion attack labels for voice-biometric and deepfake detection.
- Speech-to-speech and translation pairs Aligned source-target speech and post-edited translation across 38+ language pairs.
HOW WE LABEL
Every project clears the same six gates
Most label-quality problems are decided before annotation starts, in the schema and the guideline. We lock both with your team, then measure inter-annotator agreement on a calibration round and refuse to start production until it clears your kappa threshold.
Schema lock
Label set, taxonomy, and edge-case policy fixed with your team before anyone annotates. Verbatim-versus-clean, disfluency handling, and overlap conventions are decided here, where they are cheap to change.
Versioned guideline
Task definitions, segmentation rules, and ambiguity handling written down and version-controlled. The guideline becomes part of your EU AI Act Article 10 provenance record, not a lost Slack thread.
Calibration round
A pilot batch on a shared subset, labeled blind by multiple annotators. We measure inter-annotator agreement and timestamp tolerance, then refine the guideline where annotators disagreed.
Agreement gate
Production does not start until the calibration round clears the kappa and timestamp tolerance agreed for your tasks. Targets are kappa at or above 0.8 for objective labels, 0.6 to 0.75 for subjective ones.
Production with adjudication
Annotation runs on self-hosted CVAT and Label Studio inside EEA infrastructure. Gold items are seeded throughout, and every annotator disagreement escalates to senior adjudication rather than a majority vote.
QA and delivery
100% human QA on the delivered batch. The per-task agreement report, GDPR Article 30 processing records, signed DPA, and sub-processor list ship with the labels, not on request.
Schema lock
Label set, taxonomy, and edge-case policy fixed with your team before anyone annotates. Verbatim-versus-clean, disfluency handling, and overlap conventions are decided here, where they are cheap to change.
Versioned guideline
Task definitions, segmentation rules, and ambiguity handling written down and version-controlled. The guideline becomes part of your EU AI Act Article 10 provenance record, not a lost Slack thread.
Calibration round
A pilot batch on a shared subset, labeled blind by multiple annotators. We measure inter-annotator agreement and timestamp tolerance, then refine the guideline where annotators disagreed.
Agreement gate
Production does not start until the calibration round clears the kappa and timestamp tolerance agreed for your tasks. Targets are kappa at or above 0.8 for objective labels, 0.6 to 0.75 for subjective ones.
Production with adjudication
Annotation runs on self-hosted CVAT and Label Studio inside EEA infrastructure. Gold items are seeded throughout, and every annotator disagreement escalates to senior adjudication rather than a majority vote.
QA and delivery
100% human QA on the delivered batch. The per-task agreement report, GDPR Article 30 processing records, signed DPA, and sub-processor list ship with the labels, not on request.
Six gates. Agreement is proven before production starts, not audited after it ends.
FORCED ALIGNMENT
Words locked to the waveform, within plus or minus fifty milliseconds
Loose word boundaries are invisible in a transcript and fatal in a TTS or lipsync model. We align word and phone timestamps to a plus or minus fifty millisecond tolerance, deliver them as TextGrid or CTM, and keep disfluencies marked rather than silently dropped, because the model has to learn them too.
WHERE MODELS BREAK
The conditions your model meets that your data missed
Production audio is far-field, telephony-band, in-cabin, overlapping, code-switched, and accented. A model trained on clean close-talk speech meets all of it on day one. Coverage across these conditions, annotated consistently, closes more of the gap than another thousand hours of easy data.
HOW WE PROVE IT
We show the numbers, then co-define your targets
We do not sell an accuracy number. We report the metric types, word error rate against an expert reference, diarization error rate, and inter-annotator agreement, then co-define the targets with your team and prove them on a calibration round before production.
90%
usable at delivery
We report the metric types and co-define the thresholds with your team. No accuracy is sold as a guarantee.
THE ARTICLE 9 QUESTION
Voice is special-category only when it identifies
Most speech annotation treats voice as ordinary personal data under a GDPR Article 6 basis with Article 7 consent. Only speaker identification, using voice features to recognize a specific person, crosses into Article 9 biometric special-category, which triggers Article 9(2)(a) explicit consent and a DPIA. We assign every task its lawful basis before annotation starts, so your DPO inherits the analysis instead of reconstructing it.
This describes YPAI processing posture, not legal advice. Final lawful basis is defined per engagement with your DPO.
WHAT SHIPS WITH EVERY BATCH
The audit trail ships with the labels, not as an upsell
Every batch returns your client-schema outputs alongside the governance records that prove how the labels were made: the agreement report, the EU AI Act Article 10 data record, the signed DPA, and the sub-processor list. No upgrade tier, no separate request, no follow-up email six months later when your auditor asks.
transcripts/*.TextGrid SCHEMA diarization/*.rttm SCHEMA alignment/*.ctm SCHEMA events/*.csv REPORT nlu/intent-slot.json SCHEMA qa/iaa-report.pdf REPORT governance/art10-record.pdf PDF guidelines.md POLICY DPA.pdf CONTRACT COVERAGE AND WORKFORCE
A vetted, named contributor network under one EEA jurisdiction, not an anonymous marketplace.
SCOPE A PILOT
Send us one hour of representative audio
Tell us the tasks, the languages, and the volume. We return a labeling plan, the relevant lawful-basis mapping, and a calibration approach. Deeper scoping happens in the reply.
The five that decide the engagement
No. Voice becomes Article 9 biometric special-category only when it is processed to uniquely identify a person, such as voiceprint or speaker-identification tasks. Transcription, diarization, alignment, and event labeling run under a GDPR Article 6 basis with Article 7 consent. We assign every task its lawful basis before work begins and document it for your DPO, so the Article 9 obligations attach only where they actually apply.
Protecting the data and outsourcing the labeling are not mutually exclusive. We can run annotation inside your own EEA infrastructure or a controlled environment, so the audio never leaves your boundary, while you get a vetted multilingual workforce, kappa-gated QA, and senior adjudication you would otherwise spend months building and calibrating yourself.
Inside the EEA, from intake to erasure. YPAI is a Norwegian company with an EEA contributor network and self-hosted European infrastructure, so your audio is not subject to US CLOUD Act compulsion. Erasure runs on a 30-day GDPR Article 17 SLA, and the member-state processing region is negotiable per engagement.
Your schema first: client-schema JSON for whatever your pipeline ingests, plus the standards your tooling already expects, RTTM for diarization, TextGrid for aligned transcripts and prosody, CTM for word and phone timestamps, and CSV for events. Compatible with self-hosted CVAT and Label Studio. The labels are your work product, owned by you.
Human annotation, with 100% human QA on every delivered batch. Models assist the annotators, but the deliverable is expert human judgment. Auto-labeling trains your model on another model's mistakes and quietly entrenches them, which is the failure this service exists to prevent.
START WITH ONE HOUR OF AUDIO
Bring us the audio. We return the labels, the metrics, and the records that prove how they were made. DPA included, NDA on request, EEA from intake to erasure.
Scope an annotation pilot