PREMIUM AUDIO DATA

Voice data for the languages everyone else skips.

40,000 vetted freelancers across every EU language and the dialects that matter β€” Jutlandic Danish, Swiss German, Galician, SΓ‘mi, Frisian, and 90+ more. Audit-ready for the EU AI Act, GDPR-compliant by design.

See coverage matrix

The bottleneck isn’t model capability. It’s training data your speakers actually speak.

  • 24 1 Footnote 124 EU official languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish. EU official
    languages
  • 90+ 2 Footnote 290+ regional dialects + minority languages includes (representative sample): Jutlandic Danish, Swiss German, Galician, SΓ‘mi (Northern + Inari + Skolt), Frisian (West + North + Saterland), Basque, Welsh, Irish Gaelic, Corsican, Sicilian, Andalusian, Catalan, Faroese, Sorbian, Kashubian. Full list available in scoping call. regional dialects
    + minority langs
  • 40,000 3 Footnote 340,000 vetted freelancers: active speaker population as of Q1 2026. “Vetted” = ID-verified, audio-quality-tested, NDA-signed. Methodology shared during scoping. vetted
    freelancers

COVERAGE MATRIX

The languages and dialects we cover.

Browse 8 representative coverage areas below. Open the full matrix to see all 24 EU languages and 90+ dialects, with vetted-speaker counts per language.

Norwegian Nynorsk

1,420 vetted

Western dialects, Inland

Details

Concentrated in Vestlandet + Inland counties. ~38% female / 62% male active, ages 22–58. Common utterance length: 4–18s. Used to fix ASR bias against non-BokmΓ₯l Norwegian.

Jutlandic Danish

740 vetted

Western, SΓΈnderjysk

Details

SΓΈnderjysk (South Jutlandic) has high regional variance. Speakers vetted across Aalborg, Aarhus, Esbjerg, and TΓΈnder regions. Often missing from off-the-shelf Danish ASR.

Swiss German

985 vetted

ZΓΌrich, Bern, Basel variants

Details

All three major Alemannic urban variants plus rural fallbacks. High demand from finance + insurance NLU teams. Most utterance prompts span finance, healthcare, and consumer domains.

Galician

520 vetted

Northern + Coastal

Details

Co-official with Spanish in Galicia. Coastal vs interior phonology differs meaningfully. Used by clients building bilingual ES/GL voice products.

SΓ‘mi

210 vetted

Northern, Inari, Skolt

Details

Three of the nine SΓ‘mi languages. Northern is most populous; Inari + Skolt are critically endangered. Recruited via community liaison networks across Finnmark + Inari + SevettijΓ€rvi.

Frisian

285 vetted

West, North, Saterland

Details

West Frisian (Netherlands), North Frisian (Schleswig-Holstein), Saterland Frisian (Lower Saxony). Used in regional government accessibility + cultural-heritage projects.

Welsh

640 vetted

Northern + Southern

Details

North Welsh + South Welsh β€” the dialect boundary is meaningful for ASR. Common downstream uses: government services, BBC Cymru content tooling, education.

Corsican

155 vetted

Northern + Southern

Details

Cismontano (north) and Oltramontano (south). Italo-Romance lineage. Among the smaller pools β€” typically scoped 3–6 weeks ahead for collection.

See all 24 EU languages + 90+ dialects 59 entries
Language Region / Country Vetted speakers Dialects covered Sample audio
Germanic 13
English IE, MT (EU official) 2,900 Hiberno-English, Maltese English scoping call
German DE, AT, BE-DG, LU 2,500 Standard, Austrian, Low German, Swiss German (ZΓΌrich/Bern/Basel) scoping call
Dutch NL, BE-VL 1,300 Hollands, Flemish, Brabantian, Limburgish scoping call
Danish DK 1,000 Standard, Jutlandic (Western), SΓΈnderjysk, Bornholmian scoping call
Swedish SE, FI 1,200 Standard, SkΓ₯nsk, Gotlandic, Finland-Swedish scoping call
Finnish FI 950 Standard, Eastern, Western, Helsinki slang scoping call
Norwegian (BokmΓ₯l) NO (non-EU but EEA) 1,850 Eastern, Northern scoping call
Norwegian (Nynorsk) NO (non-EU but EEA) 1,420 Western, Inland scoping call
Faroese FO 145 TΓ³rshavn, SuΓ°uroy scoping call
Frisian (West) NL 215 Wood Frisian, Clay Frisian scoping call
Frisian (North) DE 75 Mooring, Fering scoping call
Frisian (Saterland) DE 55 Seelter scoping call
Yiddish EU-wide 105 Litvish, Galitzish scoping call
Romance 15
French FR, BE, LU 2,800 MΓ©tropolitain, Belgian, Walloon, Acadian scoping call
Italian IT, MT 2,350 Standard, Sicilian, Neapolitan, Venetian, Sardinian, Friulian scoping call
Spanish (Castilian) ES 2,700 Castilian, Andalusian, Murcian, Canarian scoping call
Catalan ES, AD 850 Central, Valencian, Balearic scoping call
Galician ES 520 Northern, Coastal scoping call
Portuguese PT 1,150 European Portuguese, Azorean, Madeiran scoping call
Romanian RO 1,100 Standard, Moldavian, Transylvanian scoping call
Asturian ES 220 Central, Western, Eastern scoping call
Sardinian IT 235 Logudorese, Campidanese scoping call
Corsican FR 155 Cismontano, Oltramontano scoping call
Sicilian IT 410 Palermitan, Catanese scoping call
Friulian IT 155 Central, Western, Carnico scoping call
Romansh CH (non-EU) 115 Sursilvan, Vallader, Puter, Rumantsch Grischun scoping call
Occitan FR, IT, ES 230 Gascon, Languedocien, ProvenΓ§al scoping call
Walloon BE 105 Central, Eastern scoping call
Celtic 6
Irish Gaelic IE (EU official) 410 Connacht, Munster, Ulster scoping call
Welsh UK (non-EU) 640 Northern, Southern scoping call
Scottish Gaelic UK (non-EU) 220 Hebridean, Highland scoping call
Manx IM (non-EU) 55 Revival cohort scoping call
Breton FR 225 Kerneveg, Leoneg, Tregerieg, Gwenedeg scoping call
Cornish UK (non-EU) 55 Kernewek Kemmyn revival scoping call
Slavic 11
Polish PL 1,950 Standard, Silesian, Kashubian-adjacent scoping call
Czech CZ 1,150 Standard, Moravian scoping call
Slovak SK 850 Standard, Eastern, Central scoping call
Slovenian SI 620 Standard, Prekmurje scoping call
Bulgarian BG 1,100 Eastern, Western scoping call
Croatian HR 1,100 Standard (Ε tokavian), Kajkavian, Chakavian scoping call
Kashubian PL 145 Northern, Southern scoping call
Silesian PL 220 Upper Silesian scoping call
Sorbian (Upper) DE 95 Bautzen / BudyΕ‘in scoping call
Sorbian (Lower) DE 55 Cottbus / ChΓ³Ε›ebuz scoping call
Rusyn SK, PL, HU 95 Carpatho-Rusyn scoping call
Uralic 9
Hungarian HU 1,450 Standard, PalΓ³c, CsΓ‘ngΓ³ scoping call
Estonian EE 850 Standard, VΓ΅ro, South Estonian scoping call
Finnish (Karelian-adjacent) FI β€” see Finnish row above see Germanic-Finnish overlap scoping call
SΓ‘mi (Northern) NO, SE, FI 145 DavvisΓ‘megiella scoping call
SΓ‘mi (Lule) NO, SE 55 JulevsΓ‘megiella scoping call
SΓ‘mi (Inari) FI 40 AnarΓ’Ε‘kielΓ’ scoping call
SÑmi (Skolt) FI, RU border cohort 40 NuárttsÀÀʹmǩiáll scoping call
Karelian FI, RU border cohort 55 Livvi-Karelian, Northern Karelian scoping call
VΓ΅ro EE 95 South Estonian scoping call
Baltic 2
Latvian LV 620 Standard, Latgalian scoping call
Lithuanian LT 620 Standard, Ε½emaitian, AukΕ‘taitian scoping call
Hellenic 1
Greek GR, CY 1,100 Standard, Cypriot Greek scoping call
Other / Isolates 2
Maltese MT 410 Standard, regional scoping call
Basque (Euskara) ES, FR 410 Biscayan, Gipuzkoan, Upper Navarrese, Lapurdian scoping call

Counts refreshed quarterly. Last refresh: 2026-Q1. Live capacity may vary by 5-10% based on freelancer availability per project window. Vetted = ID-verified, audio-quality-tested, NDA-signed.

Audio samples: native-speaker recordings from Wikimedia Lingua Libre and Wikitongues, licensed CC-BY-SA 4.0. Corsican sample is TTS-generated pending real recording.

How it works

Your specification, our pipeline, your data.

Four stages from brief to delivery. Every transition is logged for EU AI Act audit. Datasets ship with Article 10 documentation as standard.

01 Input

Brief & spec

You arrive with a use case. We arrive with a list of languages, dialects, accents, demographic distribution, recording conditions, and acceptance criteria. We converge in 1–2 calls.

  • 24 EU languages
  • 90+ dialects
  • Speaker demographics
  • Acoustic conditions
  • Acceptance criteria

02 Process

Pipeline, end to end

  1. 01 Brief spec specs locked languages, dialects, demographics, acceptance criteria
  2. 02 Freelancer matching pool of 40k filtered to qualified subset native-speaker validation, dialect coverage, region
  3. 03 Record studio-grade or remote-with-spec; quality validated per file 44.1 kHz / 16-bit min, SNR + clipping checks
  4. 04 Annotation orthographic + phonetic + speaker meta + region IPA transcription, speaker ID, accent label
  5. 05 QA + Audit Cohen’s Kappa β‰₯0.85 inter-annotator agreement EU AI Act Article 10 doc auto-generated

1 0.85 is the industry-standard threshold for premium audio data. Measured continuously, reported in the QA dashboard. Methodology shared in the scoping call.

manifest.json | json
{
  "dataset_id": "ypai-NO-nyn-2026Q2",
  "language": "nob",
  "dialect": "nynorsk-western",
  "speakers": 142,
  "hours": 218.7,
  "kappa": 0.87,
  "wer_baseline": 9.4,
  "sample_rate_hz": 48000,
  "bit_depth": 16,
  "eu_ai_act_doc": "annex/article-10.pdf",
  "license": "ypai-commercial-perpetual",
  "delivery": "2026-04-28T14:30:00Z"
}

03 Output

Dataset & docs

You receive a versioned dataset bundle plus an Article 10 compliance manifest, a QA report, and per-file metadata. Everything needed to ship to procurement, legal, and your ML team at the same time.

  • Versioned dataset
  • Article 10 manifest
  • QA report (Kappa, WER)
  • Per-file metadata
  • Speaker-consent ledger

PROOF

  • “We needed Nynorsk and Jutlandic Danish at production volume in 8 weeks. ypai delivered both with audit-ready documentation. Our WER on Western Nordic dialects dropped 12.4% after retraining.”

    Head of Speech AI

    Tier-1 European ASR provider

    300+ million end-users · EU-headquartered

  • “We replaced 19% of agent-routed calls with self-service after retraining on ypai’s regional dialect data. Procurement signed off because the audit trail mapped directly to our DPIA. Six markets, six languages, one contract.”

    VP, Conversational AI

    European telecommunications group

    12 countries · 80+ million subscribers

  • “Statutory minority-language access used to mean a backlog of manually-transcribed citizen submissions. ypai’s Sámi and Frisian coverage let us automate intake without dropping accuracy below our 92% acceptance threshold. Audit-ready by default mattered as much as the coverage.”

    Director of Digital Services

    Northern European public-sector authority

    5.4 million citizens served

Logos shown by category. And 200+ other EU enterprises. Specific references available under NDA in scoping call.

COMPLIANCE

EU AI Act effective Aug 2026 · Our datasets are audit-ready today.

Delivery API

Five lines to your first dataset.

REST + signed-URL delivery. Datasets stream to your S3, GCS, Azure, or HTTPS endpoint of choice. EU data residency enforced per contract.

REST + signed URLs

Every dataset is fetchable with a 15-min signed URL β€” stream straight to your S3, GCS, Azure, or HTTPS endpoint.

EU data residency

Frankfurt or Stockholm hosting, enforced per contract. SCC + DPA available. Speaker-consent ledger included.

Audit endpoint

GET /v1/datasets/{id}/audit returns the EU AI Act Article 10 documentation as JSON or PDF.

Scope your dataset in 30 minutes.

No commitment. Response within 1 business day.