100,000+ Hours. 50+ Dialects. One Catalog.

Browse production-ready European audio by language, environment, vertical, or compliance requirement. Every dataset includes speaker demographics, device metadata, acoustic conditions, and full consent provenance.

Request Sample Datasets Download Catalog PDF

Production-ready European audio data

100,000+ hours cataloged

50+ European dialects

20,000+ verified contributors

Full demographic metadata

GDPR-native consent

EU AI Act documentation

Find What Your Model Needs

Most audio vendors give you a spreadsheet. We give you a catalog built for how ML engineers actually evaluate training data.

Every dataset in this catalog is indexed across the dimensions that determine whether audio ships to production or fails in deployment.

Filter by language architecture

Not just "German" but Swiss German (Zürich), Swiss German (Bern), Austrian German (Vienna), Bavarian, Swabian. Not just "English" but Glaswegian, Scouse, Belfast, Dublin. Select the specific variant your users actually speak.

Filter by acoustic conditions

Studio reference. Office ambient. Automotive highway at 120 km/h. Factory floor with heavy machinery. Hospital ward. Call center crosstalk. Warehouse logistics. Maritime bridge. Each environment is tagged with SNR ranges and noise classification.

Filter by elicitation type

Scripted read speech for baseline phoneme coverage. Spontaneous conversation for real-world variation. Command and control for voice interface training. Emotional expression for prosodic models. Specify what your architecture requires.

Filter by speaker demographics

Age bands. Gender distribution. Native vs. non-native. Regional origin. Education level. Occupation category. Every speaker profile is documented. Build balanced datasets or target specific populations.

Filter by device characteristics

Professional studio microphone. Smartphone (iOS/Android, model-specific). Laptop built-in. Headset. Far-field array. In-vehicle microphone array. Know exactly what hardware captured your training data.

Filter by compliance requirements

GDPR consent with revocation support. Biometric-safe processing under Article 9. EU AI Act Article 10 documentation. Full provenance chain. De-identified variants available. Select the governance level your legal team requires.

Eight Dataset Families

1. Dialects and Regional Speech

The problem:

Your model trained on standard German hits 8% WER on broadcast news and 35% WER on a customer from Zürich. Training data recorded in Berlin doesn't transfer to Bavaria. British English benchmarks collapse in Glasgow.

Why it matters:

Dialect variation isn't an edge case. For pan-European deployment, regional speech is the majority of your production traffic. Models trained on standard accents systematically exclude large user populations.

What we provide:

Native speakers recorded in their home regions, with documented geographic origin, dialect classification, and sub-regional variation. Deep vertical coverage, not thin horizontal breadth.

Swiss German Zürich Conversational

2,400 hours of spontaneous dialogue from Zürich canton. Urban and suburban speakers. Alemannic dialect features documented.

Swiss German Bern Regional

1,800 hours covering Bernese Oberland, Emmental, and Seeland sub-dialects. Rural and small-town speakers.

Bavarian Multi-Regional

3,200 hours across Upper Bavaria, Lower Bavaria, and Upper Palatinate. Munich urban contrasted with rural Altbayern.

Austrian German Vienna

2,100 hours of Viennese German. Service industry, professional, and casual registers.

Glaswegian Conversational

1,400 hours of working-class and professional Glasgow speech. Central Belt variation.

Scouse Liverpool Urban

900 hours of Merseyside English. Multi-generational coverage.

Andalusian Spanish Multi-City

2,800 hours covering Seville, Málaga, Granada, and Cádiz. Seseo, ceceo, and aspirated /s/ variants documented.

Catalan Barcelona Regional

1,600 hours of Central Catalan. Native and bilingual speakers with code-switching to Spanish.

2. Code-Switching and Multilingual Speech

The problem:

Real European speakers switch languages mid-sentence. A Frankfurt banker mixes English and German. A Brussels professional alternates French and Dutch. Your monolingual training data can't follow the conversation.

Why it matters:

Code-switching is the default in European business, tech, and urban contexts. Models trained on monolingual corpora fail on real-world multilingual users. Intent recognition breaks when the language changes.

What we provide:

Authentic bilingual speakers producing natural code-switching. Annotated language boundaries. Both intra-sentential switching (mid-sentence) and inter-sentential switching (sentence-level alternation).

English-German Frankfurt Business

1,800 hours. Finance, consulting, and tech contexts. Intra-sentential switching dominant.

English-German Berlin Tech

1,200 hours. Startup and software development contexts. High English lexical borrowing.

English-French Brussels Professional

1,400 hours. EU institutional, legal, and business contexts.

English-French Paris Urban

1,100 hours. Service industry and professional contexts. North African French influence.

French-Arabic Paris Marseille

1,600 hours. First and second-generation speakers. Maghrebi Arabic features.

German-Turkish Berlin Cologne

1,300 hours. Second and third-generation speakers. Kiezdeutsch features documented.

English-Spanish Barcelona Miami

900 hours. Catalan-influenced Spanish with English switching.

Swedish-Finnish Helsinki Bilingual

700 hours. Finland-Swedish speakers with Finnish code-switching.

3. Noisy and Real-World Environments

The problem:

Your ASR works perfectly in the lab. Then it enters production. Highway wind noise. Factory machinery. Hospital alarms. The gap between benchmark WER and production WER is 2.8x to 5.7x.

Why it matters:

Studio recordings don't prepare models for deployment. Clean speech corpora create a domain mismatch that no amount of fine-tuning on clean data will fix. You need training data from the acoustic environments where your model will actually run.

What we provide:

Speech recorded in real operational environments with calibrated noise levels. SNR metadata. Noise type classification. Device and microphone position documentation.

Automotive Highway European

4,200 hours across 18 vehicle models. 80-130 km/h conditions. HVAC on/off. Windows up/down. Driver and passenger positions.

Automotive Urban Stop-Start

2,800 hours. City traffic conditions. Engine idle. Intersection stops. Turn signal and indicator noise.

Factory Floor Manufacturing

1,600 hours. CNC machinery. Conveyor systems. Forklift traffic. PPE-muffled speech (masks, ear protection).

Warehouse Logistics

1,200 hours. Pallet handling. Forklift operations. Scanner beeps. Ambient ventilation.

Hospital Ward Ambient

1,400 hours. Medical alarms. Paging systems. Multi-speaker clinical environments. Patient room and corridor acoustics.

Call Center Crosstalk

2,200 hours. Adjacent agent bleed. Headset audio. 8kHz telephony compression. Hold music background.

Maritime Bridge Operations

600 hours. Engine room proximity. Radio chatter. Weather exposure. Norwegian and English mixed commands.

Offshore Platform Industrial

500 hours. Machinery noise. Wind exposure. Safety equipment environments. Norwegian-English code-switching.

4. Vertical-Specific Datasets

The problem:

Your general-purpose ASR doesn't know that "cabg" means coronary artery bypass graft. It transcribes "Brent crude" as "brand crude." Domain vocabulary isn't optional—it's the difference between 5% WER and 25% WER.

Why it matters:

Every vertical has specialized terminology that general speech models never encountered in training. Medical, financial, legal, and technical domains require purpose-built corpora.

What we provide:

Industry-specific vocabulary coverage. Domain expert speakers. Realistic operational contexts. Terminology validation by subject matter experts.

Automotive Voice Command European

3,400 hours. Navigation, media control, climate, and communication commands. 15 languages. Wake word and barge-in scenarios.

Clinical Dictation German

2,100 hours across 28 specialties. Board-certified physician speakers. Cardiology, radiology, pathology, emergency medicine emphasis.

Clinical Dictation French

1,800 hours. 22 specialties. Parisian and regional accents. Inpatient and outpatient contexts.

Financial Trading German English

800 hours. Trading floor recordings. FX, equities, and fixed income terminology. Multi-speaker crosstalk.

Legal Dictation German

1,100 hours. Contract law, corporate law, litigation. Formal and informal registers.

Energy Sector Norwegian English

700 hours. Oil and gas operations. Offshore and onshore contexts. Technical terminology.

Maritime Operations Nordic

500 hours. Bridge commands. Port communications. Safety procedures. Norwegian, Swedish, Danish, English.

5. Clinical and Pathology Speech

The problem:

Speech biomarker research requires audio from clinical populations. Parkinson's tremor. Post-stroke dysarthria. Cognitive decline markers. Healthy control datasets don't capture pathological speech patterns.

Why it matters:

Healthcare AI needs training data from real patient populations, collected with appropriate consent and privacy protections. Clinical audio requires specialized collection protocols and ethical oversight.

What we provide:

Patient speech collected under clinical research protocols. IRB-equivalent ethics approval. Longitudinal tracking capability. Condition-specific recruitment.

Parkinson's Disease German

180 hours. Early and mid-stage patients. Medication on/off states. Tremor and rigidity markers.

Post-Stroke Dysarthria European

220 hours. Aphasia types documented. Recovery progression. Six languages.

Mild Cognitive Impairment Nordic

160 hours. Memory clinic patients. Age-matched healthy controls. Longitudinal samples.

Depression Screening German

140 hours. PHQ-9 validated severity levels. Prosodic and lexical markers.

Respiratory Condition Markers

120 hours. Asthma, COPD, post-COVID. Breathing patterns and voice quality changes.

6. Emotional and Prosodic Speech

The problem:

Your TTS sounds robotic because it was trained on neutral read speech. Your sentiment model can't distinguish anger from frustration. Prosodic variation requires purpose-built training data.

Why it matters:

Next-generation voice AI requires emotional range. Character voices. Dynamic dialogue. Customer sentiment detection. Neutral corpora can't teach these patterns.

What we provide:

Acted and spontaneous emotional speech. Valence and arousal annotations. Prosodic contour documentation. Voice actor and natural speaker variants.

Acted Emotion German Full Range

800 hours. Professional voice actors. Six primary emotions plus blends. High and low intensity variants.

Acted Emotion English (UK) Full Range

900 hours. Regional actors. RP and regional accent variants. Character archetypes.

Spontaneous Emotion Call Center

1,400 hours. Real customer interactions (consent-obtained). Frustration, satisfaction, confusion, urgency labeled.

Whispered Speech European

300 hours. Five languages. ASMR-adjacent and privacy-context whispers.

Shouted and Projected Speech

400 hours. Sports context. Emergency context. Crowd noise overlay variants.

Gaming Character Archetypes

600 hours. Fantasy, sci-fi, historical character types. European voice actors.

7. Minority and Low-Resource Languages

The problem:

Sami has 30,000 speakers. Basque has 750,000. Frisian has 500,000. These communities deserve voice AI that works, but commercial providers ignore them. Your European deployment isn't complete without regional and minority coverage.

Why it matters:

EU accessibility requirements increasingly cover minority language support. Public sector deployments in Norway, Finland, Spain require regional language capability. Low-resource languages need purpose-built collection, not thin crowd-sourced samples.

What we provide:

Community-partnered collection. Cultural and linguistic consultation. Dialect documentation. Orthographic and phonetic transcription.

Northern Sami Norway Finland

120 hours. Native speakers from Kautokeino, Karasjok, and Finnish Lapland. Read and spontaneous speech.

Basque Euskara Regional

340 hours. Gipuzkoan, Bizkaian, and standard Batua variants. Urban and rural speakers.

Catalan Full Regional

1,200 hours. Central, Valencian, Balearic, and Northwestern variants.

Welsh North and South

280 hours. Gwynedd and Carmarthenshire variants. First-language and learner speakers.

Breton Brittany Regional

140 hours. Elderly native speakers. Revitalization context learners.

Frisian West Frisian

160 hours. Netherlands province speakers. Dutch code-switching documented.

Faroese Iceland Comparison

90 hours. Faroese primary with Icelandic mutual intelligibility pairs.

8. Synthetic-Safe Grounding Datasets

The problem:

You're fine-tuning Whisper or training a custom ASR, and you need clean, legally unambiguous, consent-verified audio. Public datasets have unclear provenance. Your legal team wants to know exactly where every hour came from.

Why it matters:

EU AI Act Article 10 requires documented training data provenance. Undocumented data creates regulatory exposure. Foundation model training requires bulletproof consent chains.

What we provide:

Studio-quality reference recordings with unambiguous consent. Zero synthetic contamination. Full speaker demographics. Explicit commercial licensing.

Nordic Reference Corpus Clean

2,400 hours. Norwegian, Swedish, Danish, Finnish, Icelandic. Studio conditions. CC-BY-SA licensing.

German Reference Multi-Accent Clean

3,200 hours. Standard German with Austrian, Swiss, and regional variants. Studio conditions.

European Phoneme Coverage Balanced

1,800 hours. Twelve languages. Phonetically balanced sentence sets. IPA alignment.

Demographic Balanced European

4,200 hours. Age, gender, and regional quotas across ten countries. Bias evaluation documentation included.

High-Value Collections

Swiss German Complete Regional

6,200 hours across Zürich, Bern, Basel, Lucerne, and St. Gallen cantons. The most comprehensive Alemannic German corpus available for commercial licensing. Includes spontaneous conversation, read speech, and command-and-control scenarios. Sub-dialect classification at municipality level. Urban/rural speaker distribution documented. Recorded 2022-2024 on smartphones and professional equipment.

Why it matters: Swiss German is mutually unintelligible with Standard German. Models trained on High German fail systematically on Swiss users. This corpus closes the gap.

Nordic Languages Bundle

14,000 hours across Norwegian (Bokmål, Nynorsk, five dialect regions), Swedish (four dialect regions including Finland-Swedish), Danish (Copenhagen and Jutlandic), Finnish, and Icelandic. Scripted and unscripted variants. Full demographic metadata across age, gender, and regional origin.

Why it matters: No competitor offers comparable Nordic depth. Speechmatics lists these languages but doesn't publish dialect-specific coverage. This bundle covers Nordic deployment end-to-end.

European Automotive In-Cabin

8,400 hours across 22 vehicle models from eight manufacturers. Highway (100-140 km/h), urban, and idle conditions. HVAC states documented. Driver and passenger positions. 18 languages with native-accent speakers. Infotainment commands, navigation requests, and spontaneous conversation.

Why it matters: In-cabin acoustic conditions cannot be synthesized. Augmenting studio recordings with noise overlays doesn't replicate real vehicle transfer functions. This corpus captures ground-truth in-vehicle speech.

Clinical Dictation European

6,800 hours across German, French, Spanish, Italian, and Dutch. 35+ medical specialties. Board-certified and practicing physicians. Inpatient and outpatient contexts. HIPAA-equivalent consent protocols. De-identified variants available.

Why it matters: Medical terminology causes 3-5x WER degradation versus general speech. Specialty-specific vocabulary (cardiology vs. radiology vs. pathology) requires purpose-built corpora. This dataset covers the clinical documentation use case at scale.

British Isles Complete Accent Collection

8,200 hours covering Glaswegian, Scouse, Geordie, Belfast, Dublin, Cork, Welsh English, West Country, Yorkshire, and Birmingham. Native speakers recorded in home environments. Multi-generational samples. Spontaneous conversation and elicited speech.

Why it matters: British English isn't one accent. Models trained on RP or general British data fail on regional speakers. Customer service, healthcare, and public sector applications require regional coverage.

European Call Center Multilingual

12,400 hours across 22 languages. Real call center recordings (consent-obtained). Customer service, technical support, complaints, and sales contexts. Emotional state annotations. PII fully redacted. 8kHz telephony and VoIP quality variants.

Why it matters: Call center audio has unique acoustic characteristics: narrow bandwidth, compression artifacts, headset coloration, crosstalk. Models trained on wideband audio degrade on telephony. This corpus matches production conditions.

Code-Switching European Business

6,800 hours of bilingual speech across eight language pairs. English-German (Frankfurt, Berlin), English-French (Paris, Brussels), French-Arabic (Paris, Marseille), German-Turkish (Berlin, Cologne), and others. Intra-sentential and inter-sentential switching annotated.

Why it matters: Monolingual models fail on multilingual users. Code-switching is standard in European business, tech, and urban contexts. This corpus enables real-world multilingual ASR.

Emotional Speech European Acted

3,200 hours of professional voice actor recordings. Six primary emotions at three intensity levels. German, English (UK), French, Spanish, and Italian. Valence and arousal annotations. Prosodic contour documentation.

Why it matters: TTS systems require emotional range. Sentiment analysis requires labeled emotional speech. Neutral corpora can't teach these patterns. Professional acted speech provides ground-truth emotional expression.

What Ships With Every Dataset

Audio without metadata is unusable. You can't fine-tune on speakers you can't characterize. You can't balance training sets without demographic data. You can't satisfy compliance requirements without consent documentation.

Every YPAI dataset includes structured metadata at the recording, speaker, and collection-session levels.

Recording-level metadata

Duration (milliseconds)
Sample rate (typically 16kHz or 48kHz)
Bit depth
File format
Recording date
Device type and model
Microphone type
Acoustic environment classification
Signal-to-noise ratio estimate
Clipping detection flag
Silence ratio

Speaker-level metadata

Unique speaker ID (pseudonymized)
Age band
Gender
Native language
Dialect/accent classification
Geographic origin (country, region, city where applicable)
Education level
Occupation category
Self-reported language proficiency (CEFR scale for non-native)
Years of residence in recording location

Session-level metadata

Collection date
Collection method (app, studio, field recording)
Recording environment description
Noise classification
Device positioning
Elicitation type (scripted, spontaneous, command, emotional)

Consent and provenance

Consent ID linking to master consent record
Consent version (for updated consent forms)
Consent scope (what the audio may be used for)
Revocation status (propagated from master consent system)
Collection organization
Collection protocol reference
Annotation lineage (who annotated what, when, using which guidelines)

EU AI Act Article 10 documentation

Demographic distribution analysis
Geographic representation analysis
Known limitations and gaps
Data quality measures
Bias evaluation methodology
Training data sheet in standardized format

Industry Solutions

Browse by Industry

Automotive Voice AI

In-cabin recordings. Highway, urban, and idle conditions. 70+ language and dialect combinations. Wake word and command datasets. Navigation, media, climate, and communication scenarios.

70+ language/dialect combinations

22 vehicle models

18 languages with native speakers

Healthcare and Clinical Speech

Physician dictation across 35+ specialties. Ambient clinical conversation. Medical terminology in European languages. De-identified variants. HIPAA-equivalent consent.

35+ medical specialties

6,800+ hours available

De-identified variants

Finance and Call Center

Trading floor recordings. Call center customer service. Financial terminology. PCI-DSS pre-scrubbed variants. Emotional state annotations.

12,400 hours call center

22 languages

Emotional annotations

Industrial and Manufacturing

Factory floor speech. Warehouse logistics. Maritime operations. Offshore platforms. PPE-muffled speech. Industrial noise environments.

Real operational environments

SNR metadata

Safety equipment contexts

Gaming and Entertainment

Emotional speech for TTS. Character archetypes. Voice actor recordings. Prosodic variation. Whispered and projected speech.

3,200 hours emotional speech

Character archetypes

European voice actors

Broadcasting and Media

Multi-speaker panel discussions. Sports commentary. News broadcast. Proper noun emphasis. Live captioning training data.

Multi-speaker scenarios

Sports and news

European broadcasting

Ready to Build Better Voice AI?

Get Started Today

We don't ask you to trust marketing claims. We ask you to evaluate samples. Request sample datasets—specify your language, dialect, environment, and vertical. We'll send representative samples with full metadata so you can evaluate fit before any commercial discussion.

Request Sample Datasets Download Catalog PDF

Norwegian-headquartered. EEA data residency. 100,000+ hours cataloged. Audio data your models can trust.