AUDIO + SPEECH ANNOTATION

Your model is only as accurate as the transcript it learned from

Human transcript error rates run under 2% where production ASR sits near 10%. Transcription, diarization, forced alignment, events, emotion, and intent for ASR, voice, and speech-LLM teams. 150+ languages, kappa-gated, EEA-resident.

  • 150+ languages
  • 100% human QA
  • EEA-resident

ANNOTATION, NOT COLLECTION

You bring the audio. We return the labels.

Annotation of speech you already hold: call-center archives, meeting recordings, in-cabin and field audio, voice-assistant logs. If you still need to capture or record the speech itself, that is a collection engagement, and our collection teams run it separately.

You bring

  • Call-center recordings
  • Meetings and interviews
  • Broadcast and media
  • Field and in-cabin audio
  • Voice-assistant logs

We return

  • Verbatim and clean transcripts TextGrid / JSON
  • Speaker turns (diarization) RTTM
  • Word and phone timestamps CTM
  • Sound and acoustic events CSV
  • Intent, slots, and entities JSON
Need audio captured? See data collection
  • GDPR Articles 6, 7, 9
  • EU AI Act Article 10
  • EEA-resident, Norway
  • 30-day erasure SLA

WHAT WE ANNOTATE

Every layer of the signal, on one time axis

Nine annotation layers over the same audio: what was said (transcription, forced alignment, language ID, named entities), who said it (diarization, emotion), and what else the model must ignore or act on (voice activity, sound events, intent and slots). One clip, every layer, time-aligned.

Beyond the clip: dataset services for speech-LLM and governance

  • RLHF and preference data Response rating, preference comparison, and human evaluation for speech and audio model outputs.
  • Spoken-QA and instruction sets Spoken question-answer pairs and instruction-response data for speech-LLM and audio-foundation training.
  • PII redaction and de-identification Personal and special-category spans marked for removal, with a re-identification review on the result.
  • Anti-spoofing and synthetic-audio labels Real-versus-synthetic, replay, and voice-conversion attack labels for voice-biometric and deepfake detection.
  • Speech-to-speech and translation pairs Aligned source-target speech and post-edited translation across 38+ language pairs.

HOW WE LABEL

Every project clears the same six gates

Most label-quality problems are decided before annotation starts, in the schema and the guideline. We lock both with your team, then measure inter-annotator agreement on a calibration round and refuse to start production until it clears your kappa threshold.

01

Schema lock

Label set, taxonomy, and edge-case policy fixed with your team before anyone annotates. Verbatim-versus-clean, disfluency handling, and overlap conventions are decided here, where they are cheap to change.

02

Versioned guideline

Task definitions, segmentation rules, and ambiguity handling written down and version-controlled. The guideline becomes part of your EU AI Act Article 10 provenance record, not a lost Slack thread.

03

Calibration round

A pilot batch on a shared subset, labeled blind by multiple annotators. We measure inter-annotator agreement and timestamp tolerance, then refine the guideline where annotators disagreed.

04

Agreement gate

Production does not start until the calibration round clears the kappa and timestamp tolerance agreed for your tasks. Targets are kappa at or above 0.8 for objective labels, 0.6 to 0.75 for subjective ones.

05

Production with adjudication

Annotation runs on self-hosted CVAT and Label Studio inside EEA infrastructure. Gold items are seeded throughout, and every annotator disagreement escalates to senior adjudication rather than a majority vote.

06

QA and delivery

100% human QA on the delivered batch. The per-task agreement report, GDPR Article 30 processing records, signed DPA, and sub-processor list ship with the labels, not on request.

Six gates. Agreement is proven before production starts, not audited after it ends.

FORCED ALIGNMENT

Words locked to the waveform, within plus or minus fifty milliseconds

Loose word boundaries are invisible in a transcript and fatal in a TTS or lipsync model. We align word and phone timestamps to a plus or minus fifty millisecond tolerance, deliver them as TextGrid or CTM, and keep disfluencies marked rather than silently dropped, because the model has to learn them too.

token start – end ±50 ms
the 0.20s – 0.42s
model 0.50s – 0.92s
uh disfluency 0.96s – 1.10s
fails 1.14s – 1.46s
here 1.50s – 1.78s

WHERE MODELS BREAK

The conditions your model meets that your data missed

Production audio is far-field, telephony-band, in-cabin, overlapping, code-switched, and accented. A model trained on clean close-talk speech meets all of it on day one. Coverage across these conditions, annotated consistently, closes more of the gap than another thousand hours of easy data.

DEEP production-grade depth, full QA
STANDARD covered, routinely annotated
EDGE-CASE sampled, smaller corpora / on request
GAP not a standing capability
Coverage tiers are representative of standing annotation capability, not audited percentages.
150+ languages, 50+ countries, native-speaker annotators across every condition above

HOW WE PROVE IT

We show the numbers, then co-define your targets

We do not sell an accuracy number. We report the metric types, word error rate against an expert reference, diarization error rate, and inter-annotator agreement, then co-define the targets with your team and prove them on a calibration round before production.

90%

usable at delivery

>=0.80 inter-annotator agreement, objective labels

We report the metric types and co-define the thresholds with your team. No accuracy is sold as a guarantee.

Error rate (lower is better)
0% 5% 10% 15% 2% 10% 1.5% Human transcript WER 7% Diarization DER 10% Machine WER (production)
Inter-annotator agreement (Cohen / Fleiss kappa)
0 0.2 0.4 0.6 0.8 1 gate 0.80 subjective objective

WHAT SHIPS WITH EVERY BATCH

The audit trail ships with the labels, not as an upsell

Every batch returns your client-schema outputs alongside the governance records that prove how the labels were made: the agreement report, the EU AI Act Article 10 data record, the signed DPA, and the sub-processor list. No upgrade tier, no separate request, no follow-up email six months later when your auditor asks.

File Type Document Availability
transcripts/*.TextGrid SCHEMA
Transcripts (verbatim + clean) Aligned tiers, disfluencies marked.
Public
diarization/*.rttm SCHEMA
Speaker diarization Who-spoke-when segments in RTTM.
Public
alignment/*.ctm SCHEMA
Word and phone timestamps Forced alignment in CTM.
Public
events/*.csv REPORT
Sound and acoustic events Onset and offset with class labels.
Public
nlu/intent-slot.json SCHEMA
Intent, slots, and entities NLU labels and named entities.
Public
qa/iaa-report.pdf REPORT
Inter-annotator agreement report Kappa, WER and DER, gold-set results.
Public
governance/art10-record.pdf PDF
EU AI Act Article 10 data record Provenance and bias-examination notes.
Pre-contract
guidelines.md POLICY
Versioned annotation guideline Task definitions and edge-case policy.
Public
DPA.pdf CONTRACT
Signed DPA + sub-processor list GDPR Article 28 processor terms.
Pre-contract

COVERAGE AND WORKFORCE

A vetted, named contributor network under one EEA jurisdiction, not an anonymous marketplace.

150+
Languages annotated
40,000+
Vetted contributors
100%
Human QA coverage
90%
Usable rate at delivery
50+
Countries
30-day
GDPR Article 17 erasure SLA

SCOPE A PILOT

Send us one hour of representative audio

Tell us the tasks, the languages, and the volume. We return a labeling plan, the relevant lawful-basis mapping, and a calibration approach. Deeper scoping happens in the reply.

Capability lanes (NER, RLHF, etc.), languages, volume, regulatory context.

QUESTIONS BUYERS ACTUALLY ASK

The five that decide the engagement

No. Voice becomes Article 9 biometric special-category only when it is processed to uniquely identify a person, such as voiceprint or speaker-identification tasks. Transcription, diarization, alignment, and event labeling run under a GDPR Article 6 basis with Article 7 consent. We assign every task its lawful basis before work begins and document it for your DPO, so the Article 9 obligations attach only where they actually apply.

Protecting the data and outsourcing the labeling are not mutually exclusive. We can run annotation inside your own EEA infrastructure or a controlled environment, so the audio never leaves your boundary, while you get a vetted multilingual workforce, kappa-gated QA, and senior adjudication you would otherwise spend months building and calibrating yourself.

Inside the EEA, from intake to erasure. YPAI is a Norwegian company with an EEA contributor network and self-hosted European infrastructure, so your audio is not subject to US CLOUD Act compulsion. Erasure runs on a 30-day GDPR Article 17 SLA, and the member-state processing region is negotiable per engagement.

Your schema first: client-schema JSON for whatever your pipeline ingests, plus the standards your tooling already expects, RTTM for diarization, TextGrid for aligned transcripts and prosody, CTM for word and phone timestamps, and CSV for events. Compatible with self-hosted CVAT and Label Studio. The labels are your work product, owned by you.

Human annotation, with 100% human QA on every delivered batch. Models assist the annotators, but the deliverable is expert human judgment. Auto-labeling trains your model on another model's mistakes and quietly entrenches them, which is the failure this service exists to prevent.

START WITH ONE HOUR OF AUDIO

Bring us the audio. We return the labels, the metrics, and the records that prove how they were made. DPA included, NDA on request, EEA from intake to erasure.

Scope an annotation pilot
One business day reply EEA-resident, Norway