Key Takeaways
- Vendor marketing claims about accuracy percentages are meaningless without a documented methodology. Always ask for the evaluation protocol, not just the number.
- Native-speaker annotators matched to the target dialect are non-negotiable. A vendor using wrong-language workers is not annotating your audio.
- Inter-annotator agreement must be tracked and reported per batch. Vendors who cannot produce IAA scores have no quality signal.
- GDPR compliance is not a checkbox. Demand consent documentation, EEA data residency evidence, and right-to-erasure procedures before signing.
- EU AI Act Article 10 compliance requires chain-of-custody from speaker recruitment to delivery. Most vendors cannot provide this.
- A pilot on your hardest audio conditions is the only reliable evaluation. Any vendor worth working with at scale will agree to it.
Most speech data vendors make their pitch the same way. They cite an accuracy figure, name recognizable enterprise clients, and offer a competitive price per hour of audio. The problem is that none of those signals tell you whether the data will train a production-grade ASR system.
This guide covers the six evaluation criteria that actually matter, the red flags that distinguish bulk suppliers from production-grade vendors, and how to structure a pilot before committing to a volume contract.
Why vendor evaluation matters more than dataset specs
The pitch problem is real. Vendors routinely oversell by presenting favorable aggregate metrics without disclosing how those metrics were calculated, which audio conditions they apply to, or what failure modes the numbers conceal.
A vendor quoting accuracy on clean studio audio cannot be compared to one quoting accuracy on noisy in-cabin recordings. A vendor reporting IAA on a single annotator type cannot be compared to one reporting cross-annotator agreement across multiple dialect groups. The specs look comparable on paper. The data quality is not.
For enterprise ASR, where model performance directly affects product reliability, the cost of a poor vendor decision is not just the purchase price. It is the training run, the re-annotation work, and the delay in shipping.
The six evaluation criteria that matter
1. Native-speaker annotators per target language and dialect
The annotators who transcribe and label your audio must be native speakers of the specific language variant you are targeting. This is not a preference. It is a requirement for producing accurate labels.
A vendor routing Norwegian audio through annotators who speak Swedish but not Norwegian cannot reliably catch phonemic distinctions, prosodic patterns, or dialect-specific vocabulary. The errors they introduce are systematic, not random, and they compound across the dataset.
The question to ask: “How do you match annotators to specific dialects within a target language?” A credible vendor will describe a structured annotator matching protocol. A vendor who responds with a general statement about “native speakers” without dialect-level granularity does not have one.
2. Documented QA gates with inter-annotator agreement tracking
Inter-annotator agreement measures how consistently different annotators produce the same label for the same audio. Low IAA indicates that annotation guidelines are ambiguous or that annotators are not applying them consistently. High IAA confirms that your labels are reproducible and auditable.
IAA must be tracked per batch and per annotator type, not as a single aggregate across the entire project. A vendor who reports IAA only at project completion has no mechanism for catching quality drift during annotation.
Ask for the IAA methodology (Cohen’s kappa, Krippendorff’s alpha, or a domain-specific agreement metric), the thresholds that trigger annotator retraining, and a sample IAA report from a previous project. Inability to produce any of these is a disqualifying signal.
3. GDPR compliance: consent framework, EEA residency, right-to-erasure documentation
GDPR-compliant speech data collection requires that speakers provide informed consent specifically covering AI training as a use case. General consent for research or transcription is not sufficient.
Three specific questions to ask any vendor:
- Consent scope: Do your speaker consent agreements explicitly name AI training as the intended use case?
- EEA residency: Is speaker data collected, stored, and processed within the European Economic Area?
- Right to erasure: What is your documented process if a speaker requests deletion under GDPR Article 17? Can you trace and remove a specific speaker’s recordings from your corpus after delivery?
A vendor who cannot provide specific procedural answers to all three questions has not built GDPR compliance into their collection process. They have bolted on a privacy notice.
4. Data lineage: chain-of-custody from speaker recruitment to delivery
Data lineage means the ability to trace every element of the delivered corpus back to its origin: how the speaker was recruited, what consent they provided, which annotator processed each segment, what QA pass it received, and when each step occurred.
This is not an administrative nicety. For EU AI Act Article 10 compliance, high-risk AI system providers are required to demonstrate data quality documentation. Chain-of-custody records are the basis for that documentation. A vendor who cannot provide per-segment lineage cannot support your compliance obligations.
Ask for a sample data manifest from a previously delivered project. It should link audio file identifiers to speaker demographic records, annotator IDs, QA gate outcomes, and consent reference numbers.
5. Pilot availability
Any production-grade vendor offers a paid pilot before volume contracts. The pilot is where you learn whether their methodology works for your specific requirements, not from marketing materials.
A vendor who resists piloting typically has one of two problems: their quality on challenging audio conditions is weaker than their benchmarks suggest, or their workflow cannot accommodate the evaluation overhead that a genuine pilot requires.
The pilot should be sized for statistical relevance (typically 5-10 hours of audio covering your most challenging conditions) and evaluated against your own quality benchmarks, not the vendor’s.
6. EU AI Act Article 10 compliance documentation
For any organization deploying a high-risk AI system under Annex III, the data vendor must be able to provide documentation sufficient to support a conformity assessment. This includes collection methodology documentation, demographic coverage reports, bias examination records, and version-controlled dataset specifications.
Ask vendors directly: “Can you provide documentation sufficient for an EU AI Act Article 10 conformity assessment?” A vendor with genuine compliance infrastructure will have a standard documentation package. A vendor without one will respond with qualifications and caveats.
Red flags: what bad vendors say vs. what they do
“Our data is 98% accurate.” Accuracy without a methodology is not a metric. Ask: 98% on what audio conditions? Evaluated by whom? Using which metric? On which language variant? If the vendor cannot answer these questions with specifics, the number is marketing.
No IAA reporting. A vendor who mentions quality control in broad terms but cannot produce IAA scores does not track agreement systematically. Their quality signal is self-reported, not measured.
Annotators based in the wrong language region. Vendors who describe their annotator workforce in terms of language count rather than dialect-level matching are routing audio to the wrong annotators. This is one of the most common sources of systematic labeling errors in multilingual datasets.
No GDPR consent documentation. A privacy policy on the vendor’s website is not consent documentation for your corpus. Ask for the specific consent form used with speakers and verify that it covers your intended use case.
Volume pricing without pilot availability. Vendors who push directly to volume pricing and resist pilot evaluation are optimizing for contract size, not data quality. The ones confident in their output offer pilots because they know the output holds up.
The pilot evaluation: what to test before signing a volume contract
Structure your pilot around the worst-case audio conditions in your production deployment, not the best.
Select audio that represents your most challenging requirements:
- The dialect with the widest phonemic distance from standard language varieties
- The highest background noise level your system will encounter
- The domain vocabulary with the highest out-of-vocabulary rate for general ASR models
- The fastest speaking rate in your target population
Evaluate the pilot output against three criteria: transcription accuracy using your reference labels, IAA scores across the annotators who processed the batch, and metadata completeness (speaker demographic fields, recording condition tags, annotator IDs).
Request the full annotation methodology documentation alongside the files. A vendor who delivers clean audio without accompanying methodology docs has not shown you their pipeline, only its output.
Questions to ask in the vendor RFP process
These questions distinguish vendors with real quality infrastructure from those with polished sales decks.
On annotator matching:
- How do you match annotators to specific dialects within a target language?
- What is the minimum proficiency requirement for annotators working on [target language variant]?
- Can you provide a breakdown of your annotator pool by language and dialect?
On quality control:
- What IAA metric do you use and what is your minimum acceptable threshold per batch?
- What triggers annotator retraining and how is that process documented?
- What percentage of each batch receives expert blind review?
On compliance:
- Can you provide a sample consent form used with speakers?
- Where are speaker recordings stored and processed geographically?
- What is your right-to-erasure process after corpus delivery?
On documentation:
- Can you provide a sample data manifest from a previous project?
- Do you produce a datasheet covering collection methodology, preprocessing, and known limitations?
- Can you support EU AI Act Article 10 conformity assessment documentation?
A vendor who answers all of these questions with specifics and evidence is worth a pilot. A vendor who deflects, qualifies, or responds with general statements is not.
How YPAI approaches vendor evaluation criteria
YPAI collects European speech corpora across 50+ EU dialects with annotator matching at the dialect level. Bokmal, Nynorsk, and regional Norwegian varieties each use annotators selected for those specific language variants.
IAA is tracked per batch using documented methodologies, with thresholds that trigger annotator review before a batch proceeds. Every speaker provides consent that explicitly covers AI training as a use case. Data collection, storage, and processing are EEA-only, supervised under Norwegian data protection authority guidelines.
Chain-of-custody documentation is standard in every delivery: speaker recruitment records, consent reference numbers, annotator IDs, and QA gate outcomes linked at the per-segment level. Pilot availability is offered to all prospective clients before volume commitments. EU AI Act Article 10 documentation is available on request.
If you are mid-evaluation and want a reference against your procurement checklist, talk to our team or explore our speech data services.
YPAI Speech Data: Key Specifications
| Specification | Value |
|---|---|
| Verified EEA contributors | 20,000 |
| EU dialects covered | 50+ (with annotator matching per dialect) |
| Transcription IAA threshold | ≥ 0.80 Cohen’s kappa per batch |
| Data residency | EEA-only — no US sub-processors for raw audio |
| Synthetic data | None — 100% human-recorded |
| Consent standard | Explicit, purpose-specific, names AI training (GDPR Art. 6/9) |
| Erasure mechanism | Speaker-level IDs in all delivered datasets |
| Regulatory supervision | Datatilsynet (Norwegian data protection authority) |
| EU AI Act Article 10 docs | Available on request before contract signature |
Related articles
- Audio Annotation Pipeline for Speech Data Labeling
- EU AI Act High-Risk AI Training Data Requirements
- GDPR-Compliant Speech Data Collection in Europe
Sources: