Key Takeaways
- Standard software SLAs do not apply to data quality. Transcription failures surface weeks after delivery, during model training, not at handoff.
- WER thresholds must be set against a buyer-supplied reference set. A vendor measuring their own WER against their own reference has no independent check.
- Require WER at the 90th percentile, not just the mean. A long tail above threshold is a systematic error problem that gradient descent amplifies.
- Inter-annotator agreement must be reported per batch, per annotation category. A project-level aggregate IAA score is not a quality signal.
- Batch rejection rights and re-annotation timelines must be written into the contract. Without them, a vendor can dispute your quality measurement and deliver nothing.
- GDPR-specific SLA clauses are separate from data quality SLAs. Require deletion timelines, breach notification windows, and sub-processor disclosure independently.
Most speech data contracts have no quality SLAs. The vendor commits to delivering a volume of annotated audio by a deadline. What arrives is not subject to any enforceable standard.
This creates a specific risk for production ASR projects: quality failures in training data are discovered during model training, weeks or months after delivery. At that point, the contract period is often over and the buyer has no recourse. This guide covers the SLA terms that actually protect you.
Why standard SLAs do not translate to data quality
Software SLAs measure things that fail instantly and visibly: uptime, response time, error rates. Data quality failures are invisible at delivery. A batch of transcripts that misses your WER threshold looks identical to one that passes until you evaluate it against a reference set.
The timing gap is the core problem. By the time a model trained on bad data produces degraded results, the vendor relationship may be months old. Without quality SLAs written into the contract, you have accepted the data as delivered and have no basis for a dispute.
The solution is to define quality thresholds before delivery, not after.
The quality metrics that belong in a speech data SLA
Transcription accuracy: WER against a buyer-supplied reference set
Word Error Rate must be measured against a reference set supplied by the buyer, not the vendor. A vendor evaluating their own output against their own reference has no independent check. The reference set is a held-out sample of your target audio, transcribed by a domain expert or trusted independent annotator.
Set the WER threshold at the 90th percentile, not the mean. A corpus with a mean WER of 8% can have a long tail of utterances at 35%+ WER. The long tail is not averaged out during model training. It is learned. The transcription quality benchmarks post covers exactly how systematic errors in the long tail amplify during fine-tuning.
The contract should specify:
- Maximum acceptable mean WER at batch delivery
- Maximum acceptable 90th percentile WER
- The reference set procedure (who supplies it, when, and by what method)
- The evaluation tool (e.g.,
jiwer, a specific scoring script, or a named methodology)
Inter-annotator agreement: per batch, per annotation category
Inter-annotator agreement measures how consistently different annotators produce the same label for the same audio. It is the quality signal that WER cannot provide: a corpus can have acceptable aggregate WER while still having low IAA if annotators have collectively adopted the same systematic error.
Require IAA to be reported per batch, not per project. A project-level aggregate IAA score is computed after annotation is complete and has no mechanism for catching quality drift during the work. Per-batch reporting means quality is monitored as annotation proceeds.
Specify the measurement methodology. Cohen’s kappa and Krippendorff’s alpha are the two most common. The contract should name one and define the minimum acceptable score per annotation category: transcription, speaker attribution, disfluency handling, and punctuation where relevant.
Speaker attribution accuracy
For multi-speaker audio, speaker attribution is a separate quality dimension from transcription accuracy. WER can be zero while speaker mislabeling is systematic. A model trained on audio where Speaker A’s words are consistently attributed to Speaker B learns the wrong association between acoustic characteristics and speaker identity.
Require speaker attribution accuracy to be reported as its own metric, evaluated independently on a multi-speaker subset of each batch. The contract should specify the minimum acceptable attribution accuracy and the procedure for re-annotation when a batch fails it.
Delivery completeness
Metadata completeness is often omitted from quality SLAs and surfaces as a post-delivery problem. Specify:
- Every field defined in your delivery specification must be populated in every file
- Per-segment timestamps must align to audio within an acceptable tolerance (e.g., 50ms)
- Annotator ID and QA gate outcome must be present in every segment record
Incomplete metadata is not a minor formatting issue. Missing annotator IDs break chain-of-custody. Missing timestamps break downstream pipeline tooling.
What happens when SLA thresholds are missed
Threshold definitions without enforcement mechanisms are not SLAs. The contract must define:
Batch rejection rights. The buyer can reject any batch that misses a defined threshold. Rejection triggers a re-annotation obligation on the vendor. Without this clause, the vendor can dispute your measurement and deliver nothing while the project stalls.
Re-annotation timelines. The contract must specify how many days the vendor has to re-annotate and redeliver a rejected batch. A vendor who accepts re-annotation obligations without timelines can fulfill them arbitrarily slowly.
Financial remedies. Define what credit or discount applies for SLA failures. This does not need to be punitive, but it needs to exist. Without financial consequences, the SLA is advisory.
Dispute escalation. If the vendor disputes your quality measurement, the contract needs a defined process: a third-party review, a named methodology for resolving disagreement, and a timeline. Without this, disputes block re-annotation indefinitely.
Pilot evaluation as SLA baseline
SLA thresholds calibrated without data will be wrong. Either they will be set too high for the vendor to meet on your specific audio conditions, creating constant disputes, or too low to protect you.
Run a paid pilot evaluation before contract finalization. The pilot’s purpose is to establish the WER baseline and IAA baseline for this vendor on your specific audio. Use those numbers to calibrate the thresholds in the SLA.
The pilot also surfaces vendor behavior under evaluation conditions. A vendor whose quality drops significantly between the sales demo and the pilot evaluation tells you something important before you sign a volume contract.
See speech data vendor evaluation criteria for a full pilot structure framework.
GDPR-specific SLA terms
Data quality SLAs and GDPR SLAs are separate. Both are necessary. Require these independently:
Data deletion timelines. Under GDPR Article 17, speakers can request erasure of their recordings. The vendor must be able to fulfill a deletion request within a specified number of days from receiving it. Define this timeline in the contract.
Breach notification window. If the vendor experiences a data breach involving your audio corpus, you need to know quickly enough to fulfill your own breach notification obligations under GDPR Article 33. Require notification within 48 or 72 hours of the vendor becoming aware of a breach.
Sub-processor disclosure. If the vendor uses sub-processors to handle any part of the annotation workflow (crowd platforms, QA services, storage providers), they must disclose them. GDPR Article 28 requires this. Undisclosed sub-processors create compliance gaps you cannot audit.
Red flags in vendor responses to SLA requirements
“We don’t do SLAs on data quality.” This is a complete disqualifier for production ASR projects. It means the vendor is unwilling to be accountable for what they deliver.
WER claims without methodology. Any vendor quoting a WER figure without specifying the audio conditions, reference set, and evaluation method is marketing a number, not measuring quality.
No independent reference set. A vendor measuring their own WER against their own reference has no quality check. If they resist using a buyer-supplied reference set, they are not confident their output holds up to independent evaluation.
Project-level IAA only. Per-batch IAA is not a high bar. A vendor who can only produce project-level agreement scores does not monitor quality during annotation.
No escalation process for disputed results. A vendor who agrees to thresholds but has no process for resolving disputes will use that gap to contest every rejected batch.
How YPAI approaches SLA requirements
YPAI collects European speech corpora with documented WER baselines per pilot evaluation, per-batch IAA reports using documented methodologies, and batch rejection and re-annotation procedures available on request.
Data collection is EEA-only, conducted under Norwegian data protection authority supervision. Datatilsynet has overseen our collection practices. Every speaker provides consent that explicitly covers AI training use cases. Right-to-erasure procedures are documented and tested, not described in a privacy notice.
With 20,000 verified contributors and coverage across 50+ EU dialects, YPAI’s methodology is designed for buyers who need the data to hold up under independent evaluation, not just under vendor-supplied metrics.
If you are drafting SLA terms for an upcoming vendor negotiation, talk to our team to discuss how we approach quality measurement and contractual accountability.
Related articles
- Transcription Quality Benchmarks for LLM STT Training
- Audio Annotation Pipeline for Speech Data Labeling
- Speech Data Vendor Evaluation for Enterprise ASR
Sources: