Key Takeaways
- Vendor scorecards replace capability claims with measurable evidence: vendors score only on what they can document, not what they assert
- Compliance readiness (25 points) and language coverage (25 points) are the highest-weight categories because they produce binary deployment failure when gaps exist
- Data quality metrics must be corpus-specific, not presented as general vendor capability benchmarks
- Disqualification criteria exist separately from scoring: any vendor who fails a disqualification criterion is removed from evaluation regardless of scorecard total
- The scorecard process surfaces accountability: vendors who cannot produce corpus-specific documentation in response to scorecard questions will not be able to produce it for regulators
Procurement teams evaluating speech data vendors typically compare vendors on a combination of price, audio hour volume, and language coverage claims. This approach produces selection decisions based on vendor assertions rather than vendor evidence. A structured scorecard replaces assertion-based evaluation with documentation-based scoring: vendors score only on what they can demonstrate, not on what they claim.
The framework below assigns 100 points across five categories, weighted to reflect the deployment consequences of gaps in each category for EU enterprise AI systems requiring EU AI Act Article 10 compliance.
Category 1: Data Quality (20 points)
Data quality is scored on verifiable metrics from the specific corpus being delivered, not on the vendor’s general quality processes.
Inter-annotator agreement score (8 points). Request IAA scores for transcription on the specific corpus being evaluated. IAA measures annotation consistency: how much independent annotators agree on the correct transcription of each utterance. A vendor who cannot provide this score for the delivered corpus has not measured annotation quality at the level that production training data requires. Scoring: 8 points for IAA of 0.90 or above; 6 points for IAA of 0.85 to 0.89; 3 points for IAA of 0.80 to 0.84; 0 points for IAA below 0.80 or not available.
QA process and rejection rate (6 points). Request the vendor’s acceptance criteria for audio inclusion and their rejection rate for the corpus component being delivered. A rejection rate of 5 to 15% indicates the vendor is applying meaningful quality gates. A rejection rate below 2% suggests quality gates are minimal or not applied consistently. Scoring: 6 points for documented numerical criteria with 5 to 15% rejection rate; 4 points for documented criteria with rejection rate outside the target range; 2 points for documented criteria without rejection rate data; 0 points for no documented quality acceptance criteria.
Sample audio review (6 points). Request 100 representative utterances from the corpus with their transcriptions. Review for transcription accuracy, audio quality, and demographic representativeness. Procurement teams that skip sample review cannot verify vendor claims about corpus characteristics. Scoring: 6 points for no transcription errors in sample with audio quality consistent with stated conditions and demographic balance evident; 4 points for minor transcription errors or quality inconsistencies; 2 points for significant transcription errors or quality concerns; 0 points for vendor refusing to provide a sample.
Category 2: Compliance Readiness (25 points)
Compliance scoring uses the EU AI Act Article 10 documentation requirements as the reference standard.
Individual consent records (10 points). Can the vendor produce a sample individual consent record for a contributor in the corpus? The consent record should identify the contributor, the date of consent, the scope of use consented to, and the right-to-erasure procedure. Aggregate terms-of-service acceptance is not an individual consent record. Scoring: 10 points for a sample consent record provided with all required elements; 7 points for consent records that exist but are missing some required elements; 3 points for aggregated consent rather than individual records; 0 points for no consent records or vendor refusal to disclose the consent framework.
GDPR legal basis documentation (8 points). Can the vendor articulate the specific GDPR legal basis for each component of the corpus? For voice data, Article 9(2) applies (biometric data). Scoring: 8 points for a specific Article basis stated per corpus component with documentation; 5 points for a general GDPR compliance statement without component-level detail; 2 points for the vendor referencing GDPR compliance without specifying applicable articles; 0 points for the vendor being unable to articulate the GDPR legal basis.
Bias examination report (7 points). Has the vendor conducted a formal bias examination specific to this corpus? The examination should document specific demographic groups evaluated, fairness metrics applied, findings, and any mitigation steps. A general bias policy statement is not a bias examination report. Scoring: 7 points for a corpus-specific report with demographic groups, metrics, findings, and mitigations documented; 5 points for a corpus-specific analysis without complete documentation; 2 points for a general bias policy without corpus-specific analysis; 0 points for no bias examination or vendor refusal to provide one.
Category 3: Language and Demographic Coverage (25 points)
Coverage scoring evaluates match between the vendor’s actual corpus characteristics and the deployment’s target population.
Target language coverage depth (10 points). Does the vendor have documented collection experience in each of your target languages, with dialect coverage beyond standard broadcast speech? Scoring: 10 points for all target languages documented with dialect coverage data; 7 points for all target languages present but dialect coverage incomplete; 4 points for some target languages well-covered and others minimal; 0 points for one or more target languages absent from the vendor’s documented collection experience.
Demographic composition match (8 points). Can the vendor provide demographic breakdowns for the corpus by age group, gender, and regional origin? Does the composition match your deployment target population? Scoring: 8 points for a complete demographic breakdown with composition aligned to the deployment target; 6 points for a demographic breakdown available but alignment requires supplemental collection; 3 points for partial demographic data available; 0 points for no demographic breakdown available.
Acoustic condition coverage (7 points). Does the corpus include audio collected under the acoustic conditions relevant to the deployment (telephony, in-vehicle, far-field, mobile)? Scoring: 7 points for acoustic conditions documented and aligned with deployment conditions; 5 points for relevant conditions present but not fully specified; 2 points for primarily clean studio audio without deployment-relevant acoustic conditions; 0 points for acoustic conditions not documented.
Category 4: Documentation Standards (20 points)
Documentation is evaluated against EU AI Act Article 10’s requirement for collection methodology documentation, preprocessing documentation, and data lineage.
Collection methodology document (8 points). Request a collection methodology document specific to the corpus being evaluated. The document should describe contributor recruitment criteria, recording protocols, quality acceptance criteria, and annotation procedures. A general vendor methodology document that applies across all corpora is not corpus-specific documentation. Scoring: 8 points for a corpus-specific methodology document with all required sections; 5 points for a corpus-specific document missing some sections; 2 points for a general vendor methodology document only; 0 points for no written methodology documentation.
Data lineage and preprocessing logs (7 points). Can the vendor provide documentation of all preprocessing and transformation steps applied to the corpus? This includes normalization, segmentation, noise reduction, and annotation post-processing. Scoring: 7 points for a complete preprocessing log with steps and parameters documented; 5 points for partial preprocessing documentation; 2 points for preprocessing acknowledged but not documented; 0 points for no preprocessing documentation.
Right-to-erasure SLA (5 points). Does the vendor have a documented SLA for responding to GDPR Article 17 right-to-erasure requests from contributors? For a corpus with identified contributors, this SLA governs how quickly the vendor can identify and remove a contributor’s recordings. Scoring: 5 points for a documented SLA with specific response time and process; 3 points for an erasure process acknowledged but not documented with SLA; 0 points for no documented erasure process.
Category 5: Delivery and SLA (10 points)
Format specification compliance (5 points). Can the vendor deliver the corpus in the format specified in the RFP without requiring post-delivery format conversion? Format work is an uncontracted integration cost. Scoring: 5 points for format specification accepted and documented in contract; 3 points for format partially compliant with minor conversion required; 0 points for format conversion required.
Delivery timeline commitment (5 points). Has the vendor provided a milestone-based delivery schedule with specific dates, intermediate deliveries, and acceptance criteria for each milestone? Scoring: 5 points for a detailed milestone schedule with intermediate deliveries and acceptance gates; 3 points for an overall delivery date without milestones; 0 points for a timeline not specified or contingent on unspecified factors.
Disqualification criteria
The following criteria result in removal from evaluation regardless of scorecard total:
- Vendor cannot produce individual consent records for contributors (not aggregate terms-of-service acceptance)
- Vendor’s legal entity is incorporated outside the EEA and has no documented transfer mechanism for contributor data to the buyer’s jurisdiction
- Vendor cannot provide a sample audio batch with transcriptions for pre-contract review
- Vendor’s parent company or controlling entity is subject to US CLOUD Act or equivalent foreign government data access law and the vendor cannot document how this risk is mitigated
- Total bias examination section score is 0 (no examination exists for the corpus)
A vendor who triggers any disqualification criterion is not evaluable under this framework regardless of performance in other categories.
For the pre-contract questions this scorecard is derived from, see our speech data vendor due diligence guide. For the RFP structure that precedes scorecard evaluation, see our speech data vendor RFP requirements guide.
Related Resources
- Speech data vendor due diligence: 12 questions - Pre-contract questions mapped to the same four risk categories
- EU AI Act Article 10: What Speech Data Vendors Must Prove to Enterprise Buyers - Documentation requirements that define the compliance readiness category
- Custom speech corpus TCO vs off-the-shelf datasets - TCO analysis that precedes vendor selection
- EU speech data sovereignty: why GDPR is not enough - Why the disqualification criteria include sovereignty checks
- Speech data overview
- EU AI Act compliant training data
- Data processing agreement overview