Speech Data Vendor RFP: Requirements Framework

What to specify in a speech data vendor RFP: language scope, quality thresholds, GDPR compliance requirements, delivery format, and evaluation criteria.

YE YPAI Engineering · · 9 min read

Key Takeaways

  • A structured RFP forces vendors to be precise about methodology, not just volume. Vague requirements select for the lowest-cost provider, not the highest-quality one.
  • Language scope must name specific dialects. 'Norwegian' as a requirement is insufficient. Bokmal, Nynorsk, and regional variants each require separate specification.
  • Quality thresholds must be measurable: WER on a reference test set, IAA score above a defined floor, and a mandatory pilot before volume commitment.
  • GDPR compliance documentation is a procurement requirement, not a post-contract consideration. Ask for it before shortlisting.
  • Weighting price at more than 10% in a speech data RFP selects for quality shortcuts. The methodology and compliance sections should dominate the scoring.

A speech data RFP is not a volume order. It is a technical specification that determines whether the corpus you receive can actually train a production ASR system. Most enterprise procurement teams underspecify requirements and overfocus on price, which produces exactly the wrong outcome: a corpus that looks complete on paper and fails in deployment.

This framework covers what to specify in each section of a speech data vendor RFP, in enough detail that vendors cannot respond with vague claims about quality and coverage.

Why a structured RFP changes the vendor pool

An open-ended request attracts bulk providers. A requirements-based RFP filters for vendors with actual methodology.

When an RFP specifies that responses must include a documented IAA methodology, a speaker recruitment protocol for each target dialect, and a sample consent record, vendors who operate without those processes cannot respond credibly. That filtering happens before you read a single proposal. The shortlist arrives pre-qualified.

There is also a commercial benefit. Vendors who know your requirements upfront price to those requirements. Vendors who discover them mid-project request change orders. A detailed RFP reduces scope disputes.

Section 1: Scope definition

The scope section must be specific enough that two different vendors reading it produce comparable responses. Ambiguity here produces incomparable bids.

Languages and dialects

List each language variant separately. Do not write “Norwegian” if you need Bokmal and Nynorsk treated as distinct targets. Do not write “German” if you need Austrian and Swiss German alongside Hochdeutsch. Each variant should have its own line with a target volume, speaker count, and demographic requirements.

For European enterprise ASR, standard regional variants to specify include:

  • Norwegian: Bokmal (Eastern, Western), Nynorsk, Northern dialects
  • German: Standard (Hochdeutsch), Austrian, Swiss German
  • French: Standard (Parisian), Belgian, Canadian (if relevant)
  • Spanish: Castilian, Latin American if multi-region deployment is intended

A vendor who collapses these into a single language category is not capable of producing dialect-balanced data.

Speaker demographics

Define the demographic profile for each language. Minimum requirements for a production corpus include age range distribution (at minimum: under 30, 30-60, over 60), gender balance, and L2 speaker percentage if the deployed system will encounter non-native speakers. Include the native language status requirement: whether L1 speakers only are acceptable, or whether L2 speakers with a specific language background are also in scope.

Recording conditions

List the acoustic environments required, not just the dominant use case. A voice assistant deployed on mobile devices needs recordings from mobile handsets in quiet rooms and mobile handsets in ambient noise. A call center ASR system needs telephone-quality recordings. An in-cabin automotive system needs recordings with engine noise and music. Specify each condition with its target volume.

Speaking styles and volume

Distinguish scripted read speech from prompted responses from spontaneous conversational speech. Each produces different acoustic characteristics and requires different collection and annotation approaches. Specify what percentage of total volume you need from each style.

For conversational ASR, a common split is 60-70% spontaneous and 30-40% scripted. Include this in the RFP so vendors allocate collection resources correctly.

Section 2: Quality requirements

Quality requirements are the most important and most frequently underspecified section of a speech data RFP. Vague quality language produces vendor responses with vague quality claims.

Transcription accuracy

Require vendors to demonstrate word error rate on a reference test set, not self-reported accuracy on their internal data. The vendor should provide a sample of transcribed audio from a previous project in the same language and recording conditions, and you should run your own WER evaluation before shortlisting. A WER above 5% on clean speech is a red flag for production use.

Inter-annotator agreement

Specify a minimum IAA score and require vendors to report it by language and by recording condition. A Cohen’s kappa of 0.80 or above is a reasonable minimum floor for production ASR annotation. Require the methodology: which annotation pairs were compared, how many samples were drawn, and how the score was computed.

Single-annotator workflows without IAA measurement should be disqualified. Single annotation introduces systematic bias that is invisible until the model fails on specific speaker groups or recording conditions.

QA pass/fail thresholds

Ask vendors to describe what triggers re-annotation. A vendor with a documented QA gate, where a defined percentage of randomly sampled annotations failing a quality check triggers full re-annotation of that batch, is operating at production quality. A vendor who describes quality as “reviewed by a senior annotator” is not.

Style guide requirement

Require vendors to deliver their annotation style guide as part of the proposal response. The style guide is the specification for how annotators resolve ambiguous cases: how to transcribe disfluencies, how to handle overlapping speech, how to label non-lexical sounds. A vendor without a written style guide does not have consistent annotation. Read it before shortlisting.

Pilot requirement

Require a pilot before any volume commitment. The pilot should be a minimum of 50 utterances in the target language and recording conditions, annotated using the vendor’s production process. Evaluate the pilot output against your own reference transcriptions. The pilot pass threshold should be specified in the RFP so vendors know what they are being evaluated against before they respond.

Compliance requirements should be defined before the RFP goes out, not negotiated after award. Including them in the RFP surfaces vendors who cannot meet them early in the process.

Require vendors to provide a sample consent record covering: what the speaker agreed to, the purpose of collection, the data controller identity, the retention period, and the process for exercising the right to erasure under GDPR Article 17. If the vendor cannot produce a sample consent record, they are not operating a GDPR-compliant collection process.

For any vendor collecting data from EEA speakers, also require a signed Data Processing Agreement (DPA) before award. The DPA must specify data residency and confirm that processing is performed within the EEA unless you have an explicit transfer mechanism in place.

Data residency

For EU enterprise deployment, specify EEA-only data residency unless you have a legal basis for cross-border transfers. This means collection, processing, storage, and delivery must all take place within EEA jurisdiction. Require vendors to confirm their processing infrastructure and confirm that no third-party subprocessor outside the EEA handles the data.

EU AI Act Article 10 documentation

If the voice AI system being trained falls under Annex III of the EU AI Act as a high-risk application, your training data procurement must support Article 10 data governance obligations. This means requiring vendors to document the demographic representativeness of their speaker population, the bias audit methodology applied to the corpus, and the consent documentation chain.

For more detail on which systems are high-risk and what Article 10 requires, see our guide to EU AI Act high-risk AI training data requirements.

Data ownership and licensing

Specify in the RFP whether you require full ownership of the delivered corpus or a usage license. Full ownership is standard for custom collection projects and allows unrestricted use across training runs, model versions, and derived products. Licensing arrangements are acceptable for some use cases but must define what is permitted: specific model versions, specific deployment regions, or time-limited usage.

Include a clause requiring vendors to document speaker erasure obligations. If a speaker exercises their right to erasure, you need to know which recordings are affected and have a process for removing them from training data and any derived models that were trained on them.

Section 4: Technical delivery requirements

Delivery format requirements are often left to the vendor when they should be specified by the buyer. A corpus delivered in the wrong format, with incomplete metadata, costs weeks of post-processing to make usable.

Audio format

Specify the required format. For ASR training, WAV (PCM, 16-bit) at 16 kHz mono is a common standard for telephone and mobile audio. Higher sample rates (24 kHz or 48 kHz) may be appropriate for high-fidelity speech synthesis or voice biometrics. FLAC is acceptable where lossless compression is needed to reduce transfer size. Do not accept MP3 or other lossy formats for ASR training data.

Metadata schema

Specify the required metadata fields, not just the audio files. A minimum schema for production speech data includes: speaker ID, language and dialect, age range, gender, native language status, recording environment, microphone type, sample rate, and transcription confidence score. For compliance use cases, add consent record reference ID, collection date, and geographic region.

Require the metadata to be delivered in a machine-readable format (JSON or CSV) and to be field-indexed so you can filter by any attribute. A corpus delivered as a flat archive without structured metadata is not usable for production training.

Versioning and reproducibility

Require vendors to provide versioned delivery manifests. Each delivery should have a manifest file that lists every audio file, its metadata, the annotation version applied, and a hash for integrity verification. This is the minimum requirement for reproducibility: the ability to trace which data went into which training run and reproduce any historical model if needed.

Section 5: Evaluation criteria and scoring

The scoring weights in a speech data RFP determine which vendors win. Setting price as a primary criterion selects for low-cost, low-quality providers. A scoring framework that reflects what actually matters for production ASR quality:

CriterionSuggested weight
Quality methodology (IAA, QA gates, human review process)35%
Compliance documentation (GDPR, EU AI Act, data residency)25%
Language and dialect coverage20%
Timeline and delivery logistics10%
Price10%

This weighting reflects what determines whether a corpus is usable for production training, rather than what is easiest to compare across proposals. A vendor who scores 90% on quality and compliance at 20% higher price will produce a corpus that actually works. A vendor who wins on price at the cost of quality and compliance will produce a corpus that requires re-collection.

Include a pilot evaluation gate in the scoring process. Vendors who advance past the written proposal stage should complete the required pilot and be scored on pilot output quality before final award.

YPAI

YPAI collects European speech corpora across Nordic and major EU languages using a verified contributor network of 20,000 speakers operating exclusively within the EEA. Our collection process is GDPR-native: explicit consent per use case, right-to-erasure-ready speaker records, and Datatilsynet-supervised data processing.

Our Nordic coverage includes Bokmal, Nynorsk, and regional dialect variants. All transcriptions receive human verification, and our annotation pipeline uses IAA tracking at every stage. For enterprises building AI systems under EU AI Act Annex III, our corpus documentation is designed to support Article 10 compliance.

If you are writing a speech data RFP and want to discuss how to specify your requirements before the document goes out, contact our data team.



Sources:

Frequently Asked Questions

What is a reasonable pilot size before committing to a speech data collection contract?
A minimum of 50 utterances is the floor for evaluating annotation quality. For dialect-specific work or technically demanding domains, 100-200 utterances gives a more reliable signal. The pilot should include your hardest audio conditions: spontaneous speech, regional accents, and noisy recording environments. Do not run a pilot only on clean, scripted recordings.
How specific do language requirements need to be in a speech data RFP?
More specific than most procurement teams write them. 'Norwegian' does not tell a vendor whether you need Bokmal, Nynorsk, eastern dialects, western dialects, or all of the above. 'Spanish' does not specify whether you need Castilian, Latin American variants, or both. Each dialect or regional variant should be listed separately with its own volume target and speaker quota.
What is inter-annotator agreement and what score should the RFP require?
Inter-annotator agreement (IAA) measures how consistently different annotators label the same audio segment. A Cohen's kappa score above 0.80 is a reasonable minimum for production speech annotation. Scores below 0.70 indicate that annotation guidelines are ambiguous or that annotator training is insufficient. Require vendors to report IAA by language and by recording condition, not as a single aggregate number.
Who owns the speech data corpus after delivery?
This is a contract term, not a technical specification, but it belongs in the RFP because the answer determines the pricing model. Full ownership of the delivered corpus is the standard for custom collection projects. Licensing arrangements (where the vendor retains ownership and grants usage rights) typically cost less upfront but restrict what you can do with the data. Clarify this in the RFP so vendors price accordingly.

Sourcing Speech Data for an Enterprise ASR Project?

YPAI provides European-sovereign, GDPR-compliant speech corpora across Nordic and major EU languages. Our team can help you define requirements before the RFP goes out.