Data Collection

Multi-modal training data for regulated AI

Audio, image, video, LiDAR, text, TTS, transcription, and parallel corpus. Per-contributor GDPR consent. EEA-resident contributor network. EU AI Act Article 10 documentation shipped with every project.

EEA-only GDPR Art. 7 + 9 EU AI Act Art. 10 30-day erasure

Norwegian legal entity. Reply within one business day.

Modality Coverage

Eight modalities under one master DPA

Most vendors split data collection across separate service lines, each with its own legal instrument. YPAI runs audio, image, video, LiDAR, text, TTS, transcription, and parallel corpus under one master DPA, one consent layer, and one delivery contract.

Modality Volume capability Specialisation Output format Use-case anchors
Modality Audio
Volume 150+ languages, native-speaker capture
Specialisation Wake-word, in-cabin DMS, multilingual ASR, dialect coverage
Output WAV / FLAC 48kHz / 24-bit + JSON sidecar
Modality Image
Volume GDPR-faces consent on demand
Specialisation Segmentation-ready, demographic metadata per contributor
Output 12-bit RAW + COCO / YOLO / Pascal JSON
Anchors Automotive perception , Retail computer vision, Healthcare imaging
Modality Video
Volume 25,000+ files handled, multi-camera sync
Specialisation Proprietary collection platform, automated ingestion and QC
Output MP4 / MOV + per-frame metadata
Anchors Automotive ADAS , Sports analytics, Security
Modality LiDAR
Volume Sensor-fusion grade, per-project
Specialisation Time-aligned LiDAR + radar + camera, ASIL-aware taxonomy
Output PCD / LAS + sensor-fusion sidecar
Anchors Automotive perception , Robotics, Surveying
Modality Text
Volume Domain-specialised, regulated lexicons
Specialisation Legal, medical, financial corpora with provenance manifest
Output JSONL + provenance manifest
Anchors LLM fine-tuning, Domain RAG, Regulated NLP
Modality TTS
Volume Identity-verified, studio-grade
Specialisation Speaker diversity, voice-cloning consent layer (Article 9)
Output WAV studio-grade + speaker metadata
Anchors Voice assistants, Localised TTS, Accessibility
Volume 150+ languages, phoneme option
Specialisation Speaker diarisation, timing alignment, in-house QA
Output TextGrid / JSON + timing alignment
Anchors ASR training, Closed-captioning, Forensic linguistics
Volume 38+ language pairs, regulated-vertical
Specialisation Post-edited machine translation, domain-tagged
Output TMX / XLIFF + dual-source provenance
Anchors MT training, Multilingual model alignment

Consent Chain

Per-contributor consent, audit-defensible by design

Marketplace consent is platform Terms of Service: bundled, aggregate, and not GDPR Article 7 valid for biometric voice or facial data. YPAI consent is recorded per contributor, per project, with Article 9 special-category handling separated and a 30-day erasure SLA.

Regulatory Alignment

EU AI Act Article 10, mapped to deliverables

Article 10 sets data-governance obligations for high-risk AI systems. YPAI ships the documentation that supports your conformity assessment. We do not certify your AI system; we provide the evidence pack your assessment needs.

Article 10(2)(b)

data governance and management practices

Per-project data governance manifest

Sub-processor list, sampling methodology, and QA audit log shipped per project.

Article 10(3)

relevant statistical properties

Demographic and dialect metadata

Per-recording demographic, dialect, and locale metadata. Aggregate distribution report per project.

Article 10(4)

examination in view of possible biases

Bias-audit log

Sampling methodology, demographic coverage report, and known-gap declarations included in the deliverable bundle.

Article 10(5)

processing of special categories of personal data

Article 9 separated evidence pack

Special-category consent records, erasure-receipt log, and purpose-limited retention attestations.

Contributor Network

40,000 verified contributors across 150+ languages

Identity-verified network with documented EEA residency, demographic metadata per contributor, and self-hosted EU annotation infrastructure. Not a marketplace crowd: a vetted network with individual-level provenance.

Verified contributors

40,000+

Identity-verified at intake

Languages and locales

150+

Documented per-language coverage

Countries reached

50+

EEA-resident contributor network

Jurisdiction

EEA

Norwegian legal entity, not CLOUD Act

Norwegian Swedish Danish Finnish Icelandic German Dutch French Spanish Italian Portuguese Polish Czech Slovak Greek Romanian Bulgarian Hungarian Estonian Latvian Lithuanian Slovenian Croatian Maltese Irish Welsh Basque Catalan Galician and 120+ more
MTPE Coverage · 38+ language pairs Post-edited machine translation across regulated-vertical lexicons Show coverage detail

Coverage is organised in four language families plus a regulated-vertical lexicon overlay. The exact pair list and per-pair throughput are confirmed at project scoping; below are the groups in scope.

  • EU-24 EN paired with FR, DE, ES, IT, NL, PT, PL, CS, HU, EL, BG, RO, HR, SK, SL, ET, LT, LV, FI, SV, DA, MT, GA
  • Nordic + EFTA NO, IS, plus EU-Nordic cross-pairs (SV-NO, DA-NO, FI-SV)
  • Asian + Cyrillic EN paired with JA, ZH, KO, AR, RU, UK, TR
  • Vertical lexicons Legal (case-law citation, EU Regulation), Medical (ICD-11, pharmaceutical), Financial (MiFID II, KYC, accounting)

Pair-level throughput, post-edit quality grade, and reviewer credentials confirmed at project scoping. DPA and Article 28 clauses are included by default, not on request.

Deliverable Evidence Pack

Every project ships an audit-ready evidence pack

On delivery you receive the data plus a structured evidence bundle your DPO, legal counsel, and procurement team can review without follow-up. Master DPA included by default, not on request.

bundle.tree YPAI-DC / per project
project-deliverable/
  consent-records/                  [per-contributor]
    consent-{id}.json               [signed]
    article-9-special-category/     [separated pack]
  provenance/                       [per-recording]
    recording-{id}.json             [tamper-evident]
    device-locale-dialect-metadata/
  qa-logs/                          [immutable versioning]
  demographic-metadata.csv          [aggregate + per-record]
  sampling-methodology.pdf          [bias-audit input]
  bias-audit-log.pdf                [Article 10(4)]
  sub-processor-list.pdf            [transparency]
  erasure-receipts.log              [30-day SLA]
  DPA.pdf                           [signed, project-scoped]
  • consent-records/

    One JSON per contributor with the Article 7 capture, separated Article 9 evidence if applicable, and the documented revocation pathway.

  • provenance/

    Per-recording manifest with device, locale, dialect, and tamper-evident signing on the consent and metadata bundle.

  • qa-logs/ and bias-audit-log.pdf

    Documentation supporting your EU AI Act Article 10(4) bias examination. Sampling methodology, demographic coverage, and known-gap declarations included.

  • DPA.pdf and erasure-receipts.log

    Master DPA signed and project-scoped, included with every engagement. Erasure-receipt log tracks the 30-day SLA at the per-record level.

Master DPA template and audit-artefact specifications

Next Step

Scope your multi-modal collection

Tell us the modality, the regulatory context, and the volume. We map a delivery plan with the consent chain, evidence pack, and master DPA included by default.

EU AI Act Article 10 applies from 2026-08-02. Cumulative GDPR fines have passed EUR 7.1B. The cost of getting data provenance wrong is procurement-blocking; the cost of getting it right is a conversation.

Master DPA included with every YPAI engagement, not on request. Norwegian legal entity. Reply within one business day.

DATA COLLECTION INTAKE

Scope a collection project.

Bring modality, environment, volume estimate, and any regulatory constraints. A named project lead replies within one EU business day with a feasibility read.

  • GDPR Article 7 consent records on every asset
  • EEA-only operations, Norwegian Aksjeselskap
  • Zero web scraping, identity-verified contributors only
  • 30-day erasure SLA on withdrawal

GDPR Article 7 ยท GDPR Article 9 ยท EU AI Act Article 10

What happens next

From submit to scoped pilot in seven days

Three states this serves: you have submitted and want to know the timing, you are about to submit and have a procurement objection, or you are not ready to submit and want a route deeper into the work.

After you submit

  1. T+1 day

    Project lead reads your brief

    A named EU-resident project lead replies within one business day with feasibility, scope clarifications, and a first read on Article 10 risk classification.

  2. T+3 days

    Sample evidence pack returned

    Anonymised sample evidence pack ( data_provenance.pdf, bias_assessment.pdf, consent_audit.csv, residency_attestation.pdf ). Scoping call agenda agreed.

  3. T+5-7 days

    Free pilot delivered

    Free pilot covers recording AND annotation: 2 languages, 5h native-speaker recording per language, 1000 utterances per language with transcript and wake-word and intent labels. Production engagement scopes from there by modality, volume, and regulatory context.

  4. T+14 days

    Master DPA signed, production scope locked

    Article 28 clauses pre-cleared, EEA-resident processing committed in contract. Sub-processor list named, withdrawal SLA confirmed. Production data starts flowing.

Procurement FAQ

Is there a minimum project size?

No hard floor. A typical paid engagement starts around 10 to 50 hours of audio or 1k to 10k samples per modality. The free pilot is fixed at 2 languages, 5 hours, 1000 utterances per language.

Can I see a sample evidence pack before signing?

Yes. Anonymised sample evidence pack (data_provenance, bias_assessment, consent_audit, residency_attestation) is sent on procurement request during the T+3-day scoping window.

Does the DPA require negotiation?

No. Article 28 clauses are pre-cleared and included with every contract. Customer-specific addendums on residency or sub-processor scope are accepted, but the standard DPA ships by default. Not on request.

What about US-based sub-processors?

None by default. EEA-only operations with a named sub-processor list confirmed at scoping. Any US-domiciled sub-processor requires explicit customer sign-off via DPA addendum.

What is the withdrawal and erasure SLA?

30-day erasure SLA on any speaker or contributor withdrawal under GDPR Article 7. Recordings and derived datasets are wiped within 30 days; the audit log is retained for compliance traceability.

All routes preserve the same compliance posture: Norwegian Aksjeselskap, EEA-resident processing, GDPR Article 7 consent, EU AI Act Article 10 evidence at delivery.