Key Takeaways
- A structured RFP forces vendors to be precise about methodology, not just volume. Vague requirements select for the lowest-cost provider, not the highest-quality one.
- Language scope must name specific dialects. 'Norwegian' as a requirement is insufficient. Bokmal, Nynorsk, and regional variants each require separate specification.
- Quality thresholds must be measurable: WER on a reference test set, IAA score above a defined floor, and a mandatory pilot before volume commitment.
- GDPR compliance documentation is a procurement requirement, not a post-contract consideration. Ask for it before shortlisting.
- Weighting price at more than 10% in a speech data RFP selects for quality shortcuts. The methodology and compliance sections should dominate the scoring.
A speech data RFP is not a volume order. It is a technical specification that determines whether the corpus you receive can actually train a production ASR system. Most enterprise procurement teams underspecify requirements and overfocus on price, which produces exactly the wrong outcome: a corpus that looks complete on paper and fails in deployment.
This framework covers what to specify in each section of a speech data vendor RFP, in enough detail that vendors cannot respond with vague claims about quality and coverage.
Why a structured RFP changes the vendor pool
An open-ended request attracts bulk providers. A requirements-based RFP filters for vendors with actual methodology.
When an RFP specifies that responses must include a documented IAA methodology, a speaker recruitment protocol for each target dialect, and a sample consent record, vendors who operate without those processes cannot respond credibly. That filtering happens before you read a single proposal. The shortlist arrives pre-qualified.
There is also a commercial benefit. Vendors who know your requirements upfront price to those requirements. Vendors who discover them mid-project request change orders. A detailed RFP reduces scope disputes.
Section 1: Scope definition
The scope section must be specific enough that two different vendors reading it produce comparable responses. Ambiguity here produces incomparable bids.
Languages and dialects
List each language variant separately. Do not write “Norwegian” if you need Bokmal and Nynorsk treated as distinct targets. Do not write “German” if you need Austrian and Swiss German alongside Hochdeutsch. Each variant should have its own line with a target volume, speaker count, and demographic requirements.
For European enterprise ASR, standard regional variants to specify include:
- Norwegian: Bokmal (Eastern, Western), Nynorsk, Northern dialects
- German: Standard (Hochdeutsch), Austrian, Swiss German
- French: Standard (Parisian), Belgian, Canadian (if relevant)
- Spanish: Castilian, Latin American if multi-region deployment is intended
A vendor who collapses these into a single language category is not capable of producing dialect-balanced data.
Speaker demographics
Define the demographic profile for each language. Minimum requirements for a production corpus include age range distribution (at minimum: under 30, 30-60, over 60), gender balance, and L2 speaker percentage if the deployed system will encounter non-native speakers. Include the native language status requirement: whether L1 speakers only are acceptable, or whether L2 speakers with a specific language background are also in scope.
Recording conditions
List the acoustic environments required, not just the dominant use case. A voice assistant deployed on mobile devices needs recordings from mobile handsets in quiet rooms and mobile handsets in ambient noise. A call center ASR system needs telephone-quality recordings. An in-cabin automotive system needs recordings with engine noise and music. Specify each condition with its target volume.
Speaking styles and volume
Distinguish scripted read speech from prompted responses from spontaneous conversational speech. Each produces different acoustic characteristics and requires different collection and annotation approaches. Specify what percentage of total volume you need from each style.
For conversational ASR, a common split is 60-70% spontaneous and 30-40% scripted. Include this in the RFP so vendors allocate collection resources correctly.
Section 2: Quality requirements
Quality requirements are the most important and most frequently underspecified section of a speech data RFP. Vague quality language produces vendor responses with vague quality claims.
Transcription accuracy
Require vendors to demonstrate word error rate on a reference test set, not self-reported accuracy on their internal data. The vendor should provide a sample of transcribed audio from a previous project in the same language and recording conditions, and you should run your own WER evaluation before shortlisting. A WER above 5% on clean speech is a red flag for production use.
Inter-annotator agreement
Specify a minimum IAA score and require vendors to report it by language and by recording condition. A Cohen’s kappa of 0.80 or above is a reasonable minimum floor for production ASR annotation. Require the methodology: which annotation pairs were compared, how many samples were drawn, and how the score was computed.
Single-annotator workflows without IAA measurement should be disqualified. Single annotation introduces systematic bias that is invisible until the model fails on specific speaker groups or recording conditions.
QA pass/fail thresholds
Ask vendors to describe what triggers re-annotation. A vendor with a documented QA gate, where a defined percentage of randomly sampled annotations failing a quality check triggers full re-annotation of that batch, is operating at production quality. A vendor who describes quality as “reviewed by a senior annotator” is not.
Style guide requirement
Require vendors to deliver their annotation style guide as part of the proposal response. The style guide is the specification for how annotators resolve ambiguous cases: how to transcribe disfluencies, how to handle overlapping speech, how to label non-lexical sounds. A vendor without a written style guide does not have consistent annotation. Read it before shortlisting.
Pilot requirement
Require a pilot before any volume commitment. The pilot should be a minimum of 50 utterances in the target language and recording conditions, annotated using the vendor’s production process. Evaluate the pilot output against your own reference transcriptions. The pilot pass threshold should be specified in the RFP so vendors know what they are being evaluated against before they respond.
Section 3: Compliance and legal requirements
Compliance requirements should be defined before the RFP goes out, not negotiated after award. Including them in the RFP surfaces vendors who cannot meet them early in the process.
GDPR consent framework
Require vendors to provide a sample consent record covering: what the speaker agreed to, the purpose of collection, the data controller identity, the retention period, and the process for exercising the right to erasure under GDPR Article 17. If the vendor cannot produce a sample consent record, they are not operating a GDPR-compliant collection process.
For any vendor collecting data from EEA speakers, also require a signed Data Processing Agreement (DPA) before award. The DPA must specify data residency and confirm that processing is performed within the EEA unless you have an explicit transfer mechanism in place.
Data residency
For EU enterprise deployment, specify EEA-only data residency unless you have a legal basis for cross-border transfers. This means collection, processing, storage, and delivery must all take place within EEA jurisdiction. Require vendors to confirm their processing infrastructure and confirm that no third-party subprocessor outside the EEA handles the data.
EU AI Act Article 10 documentation
If the voice AI system being trained falls under Annex III of the EU AI Act as a high-risk application, your training data procurement must support Article 10 data governance obligations. This means requiring vendors to document the demographic representativeness of their speaker population, the bias audit methodology applied to the corpus, and the consent documentation chain.
For more detail on which systems are high-risk and what Article 10 requires, see our guide to EU AI Act high-risk AI training data requirements.
Data ownership and licensing
Specify in the RFP whether you require full ownership of the delivered corpus or a usage license. Full ownership is standard for custom collection projects and allows unrestricted use across training runs, model versions, and derived products. Licensing arrangements are acceptable for some use cases but must define what is permitted: specific model versions, specific deployment regions, or time-limited usage.
Include a clause requiring vendors to document speaker erasure obligations. If a speaker exercises their right to erasure, you need to know which recordings are affected and have a process for removing them from training data and any derived models that were trained on them.
Section 4: Technical delivery requirements
Delivery format requirements are often left to the vendor when they should be specified by the buyer. A corpus delivered in the wrong format, with incomplete metadata, costs weeks of post-processing to make usable.
Audio format
Specify the required format. For ASR training, WAV (PCM, 16-bit) at 16 kHz mono is a common standard for telephone and mobile audio. Higher sample rates (24 kHz or 48 kHz) may be appropriate for high-fidelity speech synthesis or voice biometrics. FLAC is acceptable where lossless compression is needed to reduce transfer size. Do not accept MP3 or other lossy formats for ASR training data.
Metadata schema
Specify the required metadata fields, not just the audio files. A minimum schema for production speech data includes: speaker ID, language and dialect, age range, gender, native language status, recording environment, microphone type, sample rate, and transcription confidence score. For compliance use cases, add consent record reference ID, collection date, and geographic region.
Require the metadata to be delivered in a machine-readable format (JSON or CSV) and to be field-indexed so you can filter by any attribute. A corpus delivered as a flat archive without structured metadata is not usable for production training.
Versioning and reproducibility
Require vendors to provide versioned delivery manifests. Each delivery should have a manifest file that lists every audio file, its metadata, the annotation version applied, and a hash for integrity verification. This is the minimum requirement for reproducibility: the ability to trace which data went into which training run and reproduce any historical model if needed.
Section 5: Evaluation criteria and scoring
The scoring weights in a speech data RFP determine which vendors win. Setting price as a primary criterion selects for low-cost, low-quality providers. A scoring framework that reflects what actually matters for production ASR quality:
| Criterion | Suggested weight |
|---|---|
| Quality methodology (IAA, QA gates, human review process) | 35% |
| Compliance documentation (GDPR, EU AI Act, data residency) | 25% |
| Language and dialect coverage | 20% |
| Timeline and delivery logistics | 10% |
| Price | 10% |
This weighting reflects what determines whether a corpus is usable for production training, rather than what is easiest to compare across proposals. A vendor who scores 90% on quality and compliance at 20% higher price will produce a corpus that actually works. A vendor who wins on price at the cost of quality and compliance will produce a corpus that requires re-collection.
Include a pilot evaluation gate in the scoring process. Vendors who advance past the written proposal stage should complete the required pilot and be scored on pilot output quality before final award.
YPAI
YPAI collects European speech corpora across Nordic and major EU languages using a verified contributor network of 20,000 speakers operating exclusively within the EEA. Our collection process is GDPR-native: explicit consent per use case, right-to-erasure-ready speaker records, and Datatilsynet-supervised data processing.
Our Nordic coverage includes Bokmal, Nynorsk, and regional dialect variants. All transcriptions receive human verification, and our annotation pipeline uses IAA tracking at every stage. For enterprises building AI systems under EU AI Act Annex III, our corpus documentation is designed to support Article 10 compliance.
If you are writing a speech data RFP and want to discuss how to specify your requirements before the document goes out, contact our data team.
Related articles
- Audio annotation pipeline for speech data labeling - how human-verified annotation pipelines are built and where they fail
- EU AI Act high-risk AI training data requirements - what Article 10 means for speech data procurement under Annex III
- Speech corpus collection services for enterprise ASR - what separates production-grade corpus collection from bulk audio
- Custom speech corpus collection
- GDPR-compliant speech data
- Evaluation program
Sources: