AUDIO + SPEECH ANNOTATION

Your model is only as accurate as the transcript it learned from

Human transcript error rates run under 2% where production ASR sits near 10%. Transcription, diarization, forced alignment, events, emotion, and intent for ASR, voice, and speech-LLM teams. 150+ languages, kappa-gated, EEA-resident.

150+ languages
100% human QA
EEA-resident

Scope an annotation pilot See how we label

ANNOTATED CLIP 2 speakers · 4.0s · 48 kHz

speaker A speaker B event Schematic of a labeled clip. Production runs against your audio.

ANNOTATION, NOT COLLECTION

You bring the audio. We return the labels.

Annotation of speech you already hold: call-center archives, meeting recordings, in-cabin and field audio, voice-assistant logs. If you still need to capture or record the speech itself, that is a collection engagement, and our collection teams run it separately.

You bring

Call-center recordings
Meetings and interviews
Broadcast and media
Field and in-cabin audio
Voice-assistant logs

We return

Verbatim and clean transcripts TextGrid / JSON
Speaker turns (diarization) RTTM
Word and phone timestamps CTM
Sound and acoustic events CSV
Intent, slots, and entities JSON

Need audio captured? See data collection

GDPR Articles 6, 7, 9
EU AI Act Article 10
EEA-resident, Norway
30-day erasure SLA

WHAT WE ANNOTATE

Every layer of the signal, on one time axis

Nine annotation layers over the same audio: what was said (transcription, forced alignment, language ID, named entities), who said it (diarization, emotion), and what else the model must ignore or act on (voice activity, sound events, intent and slots). One clip, every layer, time-aligned.

Beyond the clip: dataset services for speech-LLM and governance

RLHF and preference data Response rating, preference comparison, and human evaluation for speech and audio model outputs.
Spoken-QA and instruction sets Spoken question-answer pairs and instruction-response data for speech-LLM and audio-foundation training.
PII redaction and de-identification Personal and special-category spans marked for removal, with a re-identification review on the result.
Anti-spoofing and synthetic-audio labels Real-versus-synthetic, replay, and voice-conversion attack labels for voice-biometric and deepfake detection.
Speech-to-speech and translation pairs Aligned source-target speech and post-edited translation across 38+ language pairs.

HOW WE LABEL

Every project clears the same six gates

Most label-quality problems are decided before annotation starts, in the schema and the guideline. We lock both with your team, then measure inter-annotator agreement on a calibration round and refuse to start production until it clears your kappa threshold.

Schema lock

Label set, taxonomy, and edge-case policy fixed with your team before anyone annotates. Verbatim-versus-clean, disfluency handling, and overlap conventions are decided here, where they are cheap to change.

Versioned guideline

Task definitions, segmentation rules, and ambiguity handling written down and version-controlled. The guideline becomes part of your EU AI Act Article 10 provenance record, not a lost Slack thread.

Calibration round

A pilot batch on a shared subset, labeled blind by multiple annotators. We measure inter-annotator agreement and timestamp tolerance, then refine the guideline where annotators disagreed.

Agreement gate

Production does not start until the calibration round clears the kappa and timestamp tolerance agreed for your tasks. Targets are kappa at or above 0.8 for objective labels, 0.6 to 0.75 for subjective ones.

Production with adjudication

Annotation runs on self-hosted CVAT and Label Studio inside EEA infrastructure. Gold items are seeded throughout, and every annotator disagreement escalates to senior adjudication rather than a majority vote.

QA and delivery

100% human QA on the delivered batch. The per-task agreement report, GDPR Article 30 processing records, signed DPA, and sub-processor list ship with the labels, not on request.

Schema lock

Versioned guideline

Task definitions, segmentation rules, and ambiguity handling written down and version-controlled. The guideline becomes part of your EU AI Act Article 10 provenance record, not a lost Slack thread.

Calibration round

A pilot batch on a shared subset, labeled blind by multiple annotators. We measure inter-annotator agreement and timestamp tolerance, then refine the guideline where annotators disagreed.

Agreement gate

Production with adjudication

QA and delivery

100% human QA on the delivered batch. The per-task agreement report, GDPR Article 30 processing records, signed DPA, and sub-processor list ship with the labels, not on request.

Six gates. Agreement is proven before production starts, not audited after it ends.

Scope an annotation pilot →

FORCED ALIGNMENT

Words locked to the waveform, within plus or minus fifty milliseconds

Loose word boundaries are invisible in a transcript and fatal in a TTS or lipsync model. We align word and phone timestamps to a plus or minus fifty millisecond tolerance, deliver them as TextGrid or CTM, and keep disfluencies marked rather than silently dropped, because the model has to learn them too.

Static schematic. Production aligns your audio, sampled and QA-checked.

token start – end ±50 ms

the 0.20s – 0.42s

model 0.50s – 0.92s

uh disfluency 0.96s – 1.10s

fails 1.14s – 1.46s

here 1.50s – 1.78s

WHERE MODELS BREAK

The conditions your model meets that your data missed

Production audio is far-field, telephony-band, in-cabin, overlapping, code-switched, and accented. A model trained on clean close-talk speech meets all of it on day one. Coverage across these conditions, annotated consistently, closes more of the gap than another thousand hours of easy data.

DEEP production-grade depth, full QA

STANDARD covered, routinely annotated

EDGE-CASE sampled, smaller corpora / on request

GAP not a standing capability

Coverage tiers are representative of standing annotation capability, not audited percentages.

CleanReverberantBabble noiseMusic bgCode-switchAccented Close-talk mic Far-field array Telephony 8 kHz In-cabin VoIP / compressed

150+ languages, 50+ countries, native-speaker annotators across every condition above

HOW WE PROVE IT

We show the numbers, then co-define your targets

We do not sell an accuracy number. We report the metric types, word error rate against an expert reference, diarization error rate, and inter-annotator agreement, then co-define the targets with your team and prove them on a calibration round before production.

90%

usable at delivery

>=0.80 inter-annotator agreement, objective labels

We report the metric types and co-define the thresholds with your team. No accuracy is sold as a guarantee.

Error rate (lower is better)

Inter-annotator agreement (Cohen / Fleiss kappa)

WHAT SHIPS WITH EVERY BATCH

The audit trail ships with the labels, not as an upsell

Every batch returns your client-schema outputs alongside the governance records that prove how the labels were made: the agreement report, the EU AI Act Article 10 data record, the signed DPA, and the sub-processor list. No upgrade tier, no separate request, no follow-up email six months later when your auditor asks.

File Type Document Availability

transcripts/*.TextGrid SCHEMA

Transcripts (verbatim + clean) Aligned tiers, disfluencies marked.

Public

diarization/*.rttm SCHEMA

Speaker diarization Who-spoke-when segments in RTTM.

Public

alignment/*.ctm SCHEMA

Word and phone timestamps Forced alignment in CTM.

Public

events/*.csv REPORT

Sound and acoustic events Onset and offset with class labels.

Public

nlu/intent-slot.json SCHEMA

Intent, slots, and entities NLU labels and named entities.

Public

qa/iaa-report.pdf REPORT

Inter-annotator agreement report Kappa, WER and DER, gold-set results.

Public

governance/art10-record.pdf PDF

EU AI Act Article 10 data record Provenance and bias-examination notes.

Pre-contract

guidelines.md POLICY

Versioned annotation guideline Task definitions and edge-case policy.

Public

DPA.pdf CONTRACT

Signed DPA + sub-processor list GDPR Article 28 processor terms.

Pre-contract

Request the DPA

COVERAGE AND WORKFORCE

A vetted, named contributor network under one EEA jurisdiction, not an anonymous marketplace.

150+

Languages annotated

40,000+

Vetted contributors

100%

Human QA coverage

90%

Usable rate at delivery

50+

Countries

30-day

GDPR Article 17 erasure SLA

SCOPE A PILOT

Send us one hour of representative audio

Tell us the tasks, the languages, and the volume. We return a labeling plan, the relevant lawful-basis mapping, and a calibration approach. Deeper scoping happens in the reply.

QUESTIONS BUYERS ACTUALLY ASK

The five that decide the engagement

Is voice data automatically special-category under GDPR Article 9?

No. Voice becomes Article 9 biometric special-category only when it is processed to uniquely identify a person, such as voiceprint or speaker-identification tasks. Transcription, diarization, alignment, and event labeling run under a GDPR Article 6 basis with Article 7 consent. We assign every task its lawful basis before work begins and document it for your DPO, so the Article 9 obligations attach only where they actually apply.

We are considering annotating in-house to protect the data. Why use YPAI?

Protecting the data and outsourcing the labeling are not mutually exclusive. We can run annotation inside your own EEA infrastructure or a controlled environment, so the audio never leaves your boundary, while you get a vetted multilingual workforce, kappa-gated QA, and senior adjudication you would otherwise spend months building and calibrating yourself.

Where is our audio processed?

Inside the EEA, from intake to erasure. YPAI is a Norwegian company with an EEA contributor network and self-hosted European infrastructure, so your audio is not subject to US CLOUD Act compulsion. Erasure runs on a 30-day GDPR Article 17 SLA, and the member-state processing region is negotiable per engagement.

What output formats do you deliver?

Your schema first: client-schema JSON for whatever your pipeline ingests, plus the standards your tooling already expects, RTTM for diarization, TextGrid for aligned transcripts and prosody, CTM for word and phone timestamps, and CSV for events. Compatible with self-hosted CVAT and Label Studio. The labels are your work product, owned by you.

Is this human annotation or auto-labeling?

Human annotation, with 100% human QA on every delivered batch. Models assist the annotators, but the deliverable is expert human judgment. Auto-labeling trains your model on another model's mistakes and quietly entrenches them, which is the failure this service exists to prevent.

START WITH ONE HOUR OF AUDIO

Bring us the audio. We return the labels, the metrics, and the records that prove how they were made. DPA included, NDA on request, EEA from intake to erasure.

Scope an annotation pilot

One business day reply EEA-resident, Norway

Your model is only as accurate as the transcript it learned from

Every project clears the same six gates

Schema lock

Versioned guideline

Calibration round

Agreement gate

Production with adjudication

QA and delivery

Schema lock

Versioned guideline

Calibration round

Agreement gate

Production with adjudication

QA and delivery

We are reviewing your brief.

The five that decide the engagement

Is voice data automatically special-category under GDPR Article 9?

We are considering annotating in-house to protect the data. Why use YPAI?

Where is our audio processed?

What output formats do you deliver?

Is this human annotation or auto-labeling?