Speech Datasets

Pre-Built Speech Corpora for Faster Model Development

Ready-to-use multilingual speech datasets with full provenance documentation. Skip the 6-month collection cycle. Start training today.

100,000+
Hours Cataloged
50+
European Dialects
20,000+
Verified Contributors
GDPR
Native Consent
Dataset Categories

Six families of production-ready speech data

European Languages

Norwegian, Swedish, Danish, Finnish, German, French, Spanish, Italian, Dutch, Polish. Read, spontaneous, and conversational speech across major EU languages.

Nordic Dialects

Deep dialect coverage: Bergen, Oslo, Stavanger, Trondheim, Northern Norwegian. Stockholm, Gothenburg, Skane Swedish.

Automotive

In-vehicle recordings with noise-condition metadata. Highway, urban, and idle states across multiple European languages.

Healthcare

Clinical terminology and patient-provider dialogue (anonymized). HIPAA-compatible processing with full de-identification.

Code-Switching

Bilingual corpora with real multilingual speakers. Norwegian-English, German-Turkish, French-Arabic, and more.

Evaluation Sets

Benchmark datasets for testing ASR model performance across dialects and accents. Phonetically balanced with demographic metadata.

What Ships

Audio without metadata is unusable

You can't fine-tune on speakers you can't characterize. You can't balance training sets without demographic data. You can't satisfy compliance without consent documentation. Every YPAI dataset ships complete.

WAV / FLAC

Audio Files

Configurable sample rate, lossless formats standard

Verbatim

Transcriptions

Time-aligned, speaker-tagged, normalized variants

Full Profile

Speaker Metadata

Age, gender, accent, dialect, location, device, environment

Per Speaker

Consent Docs

Individual consent records with revocation support

EU AI Act

Data Cards

Article 10 compliant documentation and bias analysis

Audited

QA Metrics

Acceptance rate, inter-annotator agreement scores

When to Use Each

Off-the-shelf datasets vs. custom collection

Pre-built corpora for speed. Custom collection for precision. Both with full provenance.

Off-the-Shelf
Custom Collection
Timeline
Days (immediate licensing)
Weeks to months
Cost
Lower (shared production cost)
Higher (dedicated project)
Customization
Fixed specifications
Fully tailored to requirements
Exclusivity
Non-exclusive license
Exclusive option available
Provenance
Full metadata and consent chain
Full metadata and consent chain
Best for
Prototyping, benchmarking, augmentation
Production training, specific requirements

Many teams start with off-the-shelf datasets for prototyping and benchmarking, then commission custom collection for production training once requirements are validated. Both options include identical metadata and consent documentation.

Evaluation Process

From sample to production

Evaluate before you commit. Every dataset is available for sampling before licensing.

01

Sample

Request representative samples from our catalog. Specify language, dialect, environment, and vertical.

02

Review

Evaluate audio quality, transcription accuracy, and metadata completeness against your pipeline requirements.

03

License

Choose your licensing model. Non-exclusive for cost efficiency, exclusive for competitive advantage.

04

Integrate

Receive structured delivery with documented schema. Plug directly into your training pipeline.

Get Started

Ready to evaluate samples?

We don't ask you to trust marketing claims. Request sample datasets with full metadata so you can evaluate fit before any commercial discussion.

Norwegian-headquartered. EEA data residency. 100,000+ hours cataloged.