What WER threshold is reasonable to put in a speech data SLA?

The right threshold depends on your audio conditions and task. For clean read speech, requiring WER below 5% on a held-out reference set is achievable from production-grade vendors. For spontaneous conversational speech in a challenging dialect, a threshold of 10-15% may be more realistic. The key is calibrating the threshold using a pilot evaluation before contract finalization, not setting an aspirational number neither party can verify.

Who supplies the reference set for WER measurement?

The buyer must supply the reference set. A vendor measuring their own WER against their own reference has no independent quality check. Supply a held-out reference set produced by a domain expert or a trusted independent annotator. This is the only way to make WER measurement meaningful as a contract term.

What does inter-annotator agreement tell you that WER does not?

WER measures transcription accuracy against a reference. IAA measures consistency between annotators working independently on the same audio. Low IAA can coexist with acceptable WER if annotators have systematically adopted the same error pattern. IAA is the signal that your labels are reproducible and that the annotation guidelines are being applied consistently. A corpus with high WER accuracy but low IAA has unstable labels.

What happens if the vendor disputes my WER measurement?

The contract must define an escalation process for disputed quality measurements. This typically means a third-party annotation review on the disputed batch, with a defined timeline and a clear decision rule. Without this clause, a vendor can delay re-annotation indefinitely by contesting your results.

Do GDPR SLA terms apply if the vendor is outside the EU?

Yes, if the audio data contains speech from EU residents. GDPR follows the data subject, not the vendor's location. Any vendor processing EU resident voice recordings must comply with GDPR regardless of their own jurisdiction. Right-to-erasure timelines, breach notification, and sub-processor disclosure apply.

Speech Data Vendor SLA Requirements for ASR

Most speech data contracts have no quality SLAs. The vendor commits to delivering a volume of annotated audio by a deadline. What arrives is not subject to any enforceable standard.

This creates a specific risk for production ASR projects: quality failures in training data are discovered during model training, weeks or months after delivery. At that point, the contract period is often over and the buyer has no recourse. This guide covers the SLA terms that actually protect you.

Why standard SLAs do not translate to data quality

Software SLAs measure things that fail instantly and visibly: uptime, response time, error rates. Data quality failures are invisible at delivery. A batch of transcripts that misses your WER threshold looks identical to one that passes until you evaluate it against a reference set.

The timing gap is the core problem. By the time a model trained on bad data produces degraded results, the vendor relationship may be months old. Without quality SLAs written into the contract, you have accepted the data as delivered and have no basis for a dispute.

The solution is to define quality thresholds before delivery, not after.

The quality metrics that belong in a speech data SLA

Transcription accuracy: WER against a buyer-supplied reference set

Word Error Rate must be measured against a reference set supplied by the buyer, not the vendor. A vendor evaluating their own output against their own reference has no independent check. The reference set is a held-out sample of your target audio, transcribed by a domain expert or trusted independent annotator.

Set the WER threshold at the 90th percentile, not the mean. A corpus with a mean WER of 8% can have a long tail of utterances at 35%+ WER. The long tail is not averaged out during model training. It is learned. The transcription quality benchmarks post covers exactly how systematic errors in the long tail amplify during fine-tuning.

The contract should specify:

Maximum acceptable mean WER at batch delivery
Maximum acceptable 90th percentile WER
The reference set procedure (who supplies it, when, and by what method)
The evaluation tool (e.g., jiwer, a specific scoring script, or a named methodology)

Inter-annotator agreement: per batch, per annotation category

Inter-annotator agreement measures how consistently different annotators produce the same label for the same audio. It is the quality signal that WER cannot provide: a corpus can have acceptable aggregate WER while still having low IAA if annotators have collectively adopted the same systematic error.

Require IAA to be reported per batch, not per project. A project-level aggregate IAA score is computed after annotation is complete and has no mechanism for catching quality drift during the work. Per-batch reporting means quality is monitored as annotation proceeds.

Specify the measurement methodology. Cohen’s kappa and Krippendorff’s alpha are the two most common. The contract should name one and define the minimum acceptable score per annotation category: transcription, speaker attribution, disfluency handling, and punctuation where relevant.

Speaker attribution accuracy

For multi-speaker audio, speaker attribution is a separate quality dimension from transcription accuracy. WER can be zero while speaker mislabeling is systematic. A model trained on audio where Speaker A’s words are consistently attributed to Speaker B learns the wrong association between acoustic characteristics and speaker identity.

Require speaker attribution accuracy to be reported as its own metric, evaluated independently on a multi-speaker subset of each batch. The contract should specify the minimum acceptable attribution accuracy and the procedure for re-annotation when a batch fails it.

Delivery completeness

Metadata completeness is often omitted from quality SLAs and surfaces as a post-delivery problem. Specify:

Every field defined in your delivery specification must be populated in every file
Per-segment timestamps must align to audio within an acceptable tolerance (e.g., 50ms)
Annotator ID and QA gate outcome must be present in every segment record

Incomplete metadata is not a minor formatting issue. Missing annotator IDs break chain-of-custody. Missing timestamps break downstream pipeline tooling.

What happens when SLA thresholds are missed

Threshold definitions without enforcement mechanisms are not SLAs. The contract must define:

Batch rejection rights. The buyer can reject any batch that misses a defined threshold. Rejection triggers a re-annotation obligation on the vendor. Without this clause, the vendor can dispute your measurement and deliver nothing while the project stalls.

Re-annotation timelines. The contract must specify how many days the vendor has to re-annotate and redeliver a rejected batch. A vendor who accepts re-annotation obligations without timelines can fulfill them arbitrarily slowly.

Financial remedies. Define what credit or discount applies for SLA failures. This does not need to be punitive, but it needs to exist. Without financial consequences, the SLA is advisory.

Dispute escalation. If the vendor disputes your quality measurement, the contract needs a defined process: a third-party review, a named methodology for resolving disagreement, and a timeline. Without this, disputes block re-annotation indefinitely.

Pilot evaluation as SLA baseline

SLA thresholds calibrated without data will be wrong. Either they will be set too high for the vendor to meet on your specific audio conditions, creating constant disputes, or too low to protect you.

Run a paid pilot evaluation before contract finalization. The pilot’s purpose is to establish the WER baseline and IAA baseline for this vendor on your specific audio. Use those numbers to calibrate the thresholds in the SLA.

The pilot also surfaces vendor behavior under evaluation conditions. A vendor whose quality drops significantly between the sales demo and the pilot evaluation tells you something important before you sign a volume contract.

See speech data vendor evaluation criteria for a full pilot structure framework.

Data quality SLAs and GDPR SLAs are separate. Both are necessary. Require these independently:

Data deletion timelines. Under GDPR Article 17, speakers can request erasure of their recordings. The vendor must be able to fulfill a deletion request within a specified number of days from receiving it. Define this timeline in the contract.

Breach notification window. If the vendor experiences a data breach involving your audio corpus, you need to know quickly enough to fulfill your own breach notification obligations under GDPR Article 33. Require notification within 48 or 72 hours of the vendor becoming aware of a breach.

Sub-processor disclosure. If the vendor uses sub-processors to handle any part of the annotation workflow (crowd platforms, QA services, storage providers), they must disclose them. GDPR Article 28 requires this. Undisclosed sub-processors create compliance gaps you cannot audit.

Red flags in vendor responses to SLA requirements

“We don’t do SLAs on data quality.” This is a complete disqualifier for production ASR projects. It means the vendor is unwilling to be accountable for what they deliver.

WER claims without methodology. Any vendor quoting a WER figure without specifying the audio conditions, reference set, and evaluation method is marketing a number, not measuring quality.

No independent reference set. A vendor measuring their own WER against their own reference has no quality check. If they resist using a buyer-supplied reference set, they are not confident their output holds up to independent evaluation.

Project-level IAA only. Per-batch IAA is not a high bar. A vendor who can only produce project-level agreement scores does not monitor quality during annotation.

No escalation process for disputed results. A vendor who agrees to thresholds but has no process for resolving disputes will use that gap to contest every rejected batch.

How YPAI approaches SLA requirements

YPAI collects European speech corpora with documented WER baselines per pilot evaluation, per-batch IAA reports using documented methodologies, and batch rejection and re-annotation procedures available on request.

Data collection is EEA-only, conducted under Norwegian data protection authority supervision. Datatilsynet has overseen our collection practices. Every speaker provides consent that explicitly covers AI training use cases. Right-to-erasure procedures are documented and tested, not described in a privacy notice.

With 20,000 verified contributors and coverage across 50+ EU dialects, YPAI’s methodology is designed for buyers who need the data to hold up under independent evaluation, not just under vendor-supplied metrics.

If you are drafting SLA terms for an upcoming vendor negotiation, talk to our team to discuss how we approach quality measurement and contractual accountability.

Sources:

Speech Data Vendor SLA Requirements for ASR

Key Takeaways

Why standard SLAs do not translate to data quality

The quality metrics that belong in a speech data SLA

Transcription accuracy: WER against a buyer-supplied reference set

Inter-annotator agreement: per batch, per annotation category

Speaker attribution accuracy

Delivery completeness

What happens when SLA thresholds are missed

Pilot evaluation as SLA baseline

Red flags in vendor responses to SLA requirements

How YPAI approaches SLA requirements

Frequently Asked Questions

Need SLA-Backed Speech Data?

More from Data Engineering

AI Data Annotation Services: Labelbox vs Appen vs Scale AI

AI Data Annotation Services: Comparing Providers

AI Training Data: The Complete Enterprise Guide

Speech Data Vendor Scorecard: Evaluation Framework

Swedish and Danish ASR Dialect Challenges

Speech Data Vendor SLA Requirements for ASR

Key Takeaways

Why standard SLAs do not translate to data quality

The quality metrics that belong in a speech data SLA

Transcription accuracy: WER against a buyer-supplied reference set

Inter-annotator agreement: per batch, per annotation category

Speaker attribution accuracy

Delivery completeness

What happens when SLA thresholds are missed

Pilot evaluation as SLA baseline

GDPR-specific SLA terms

Red flags in vendor responses to SLA requirements

How YPAI approaches SLA requirements

Related articles

Frequently Asked Questions

Need SLA-Backed Speech Data?

More from Data Engineering

AI Data Annotation Services: Labelbox vs Appen vs Scale AI

AI Data Annotation Services: Comparing Providers

AI Training Data: The Complete Enterprise Guide

Speech Data Vendor Scorecard: Evaluation Framework

Swedish and Danish ASR Dialect Challenges