- Version
- v1.0
- Last updated
- Audience
- For: ML, Data Eng, CISO, Procurement
Data Collection Infrastructure at YPAI
This brief documents the architecture, data flows, evidence artifacts, and regulatory mappings of YPAI multimodal data collection infrastructure. It targets engineering, security, and procurement teams conducting vendor diligence. The document supersedes prior architectural summaries.
Executive Summary
YPAI is a Norwegian Aksjeselskap headquartered in Oslo, operating with EEA-resident infrastructure. The platform covers data collection, annotation, validation, and delivery across audio and speech, image, video, text, LiDAR and 3D point clouds, and sensor modalities. The contributor network spans 150+ languages and 50+ countries.
Four structural pillars differentiate the platform from US-headquartered vendors: Norwegian corporate jurisdiction outside US CLOUD Act reach, self-hosted single-tenant annotation infrastructure on CVAT, Label Studio, and proprietary tooling, per-contributor cryptographic consent records aligned with GDPR Article 7, and microsecond-precision multimodal synchronization.
Architectural Overview
Data flows through five stages: collect, ingest, annotate, QA, deliver. Edge devices encrypt payloads using TLS 1.3 before transmission. The ingestion gateway authenticates devices via mutually authenticated certificates, decrypts the transport layer, and re-encrypts the payload at rest using AES-256-GCM with customer-scoped envelope keys.
The annotation environment ingests data from the storage layer through a one-way replication channel. QA workflows operate on annotated outputs only. The delivery service signs and ships datasets through customer-elected transport: SFTP, SCCs-backed object storage transfer, or customer-managed key escrow.
Five zero-trust principles operate at every stage: mutual TLS at every hop, no implicit trust between stages, per-stage RBAC, immutable audit logging, default-deny network ACLs. The YPAI is a Norwegian company, which places customer data outside the reach of US CLOUD Act compulsion and aligns with GDPR Article 48 by default. See provenance audit detail.
Data Collection and Provenance
Modality coverage spans audio and speech, video, image, LiDAR and 3D point cloud, sensor (radar, IoT, wearables, environmental), and text and NLP. Collection quality is not a post-processing step. For each modality, YPAI specifies the capture parameters, hardware, and signal targets before a session begins, and verifies them against named industry standards: ISO/IEC 5259-4 for data-quality frameworks, ITU-T P.56 for active speech level, EBU R128 and ITU-R BS.1770 for loudness, and SMPTE ST 2110 for multi-sensor synchronization.
3.1 Voice and speech
Voice data quality is bounded by capture physics, not by post-processing. YPAI specifies the acoustic environment, the capture hardware, and the signal targets before a session begins. Active speech level is computed per ITU-T P.56; loudness is normalized per EBU R128 and ITU-R BS.1770.
| Parameter | Specification |
|---|---|
| Sample rate | 16 kHz for ASR baselines; 44.1 / 48 kHz for TTS, voice cloning, and high-fidelity work |
| Bit depth | 16-bit (96 dB dynamic range) baseline; 24-bit (144 dB) for studio, TTS, and automotive captures |
| Container and codec | PCM WAV or FLAC, lossless only. Lossy formats (MP3, AAC) are excluded from training corpora |
| Channels | Mono for TTS; multi-track or microphone-array capture for diarization, spatial, and far-field work |
| Capture hardware | Large-diaphragm condenser microphones (sE2200 class, self-noise below 12 dB(A)) into a 24-bit interface (Focusrite Scarlett 2i2 class) |
| Language coverage | 150+ languages |
| Contributor network | 40,000+ vetted contributors across 50+ countries |
YPAI classifies every voice engagement into one of four acoustic environment classes, each with a defined signal target. The target is set at scoping and verified at quality control.
| Environment class | SNR target | Noise floor | Reverberation |
|---|---|---|---|
| Studio | above 40 dB | below -60 dBFS | RT60 below 0.2 s |
| Quiet indoor / office | 30 dB or above | below -45 dBFS | controlled |
| In-cabin (automotive) | 10 to 25 dB | environment-dependent | environment-dependent |
| Street / far-field | 5 to 15 dB | environment-dependent | environment-dependent |
Voice engagements specify stratified age bands, gender balance, and regional accent spread. Read-speech protocols enforce phonetic balance; spontaneous-speech protocols capture natural disfluency; code-switching prompts support multilingual models.
3.2 Video
Video data quality is set by sensor capability, frame cadence, and synchronization. Multi-camera capture runs on SMPTE ST 2110 transport with PTP (Precision Time Protocol) for sub-frame alignment across the sensor array.
| Parameter | Specification |
|---|---|
| Resolution | 1080p baseline; 4K UHD for dense-scene and autonomous-vehicle work; 8K for wide-crop and remote sensing |
| Frame rate | 24 / 30 fps for standard capture; 60 / 120 fps for high-speed motion and AV perception |
| Codec | H.264 / H.265 for delivery; ProRes / DNxHR for edit-grade intra-frame; RAW where the pipeline requires it |
| Bit depth | 8-bit standard; 10-bit for HDR captures |
| Chroma subsampling | 4:2:0 for general capture; 4:2:2 / 4:4:4 for segmentation and edge-precise annotation |
| Multi-camera sync | SMPTE ST 2110 transport with PTP (Precision Time Protocol) sub-frame alignment |
Video capture spans a defined environmental matrix: lighting conditions (daylight, overcast, golden hour, twilight, low-light), weather (rain, snow, fog), and controlled motion and occlusion scenarios.
3.3 Image
Image data quality is bounded by sensor physics and is increasingly threatened by synthetic-data contamination. YPAI specifies optical floors and rejects generative output at intake, so the corpus reflects physical capture rather than diffusion-model artifacts.
| Parameter | Specification |
|---|---|
| Resolution floor | 12 MP minimum for general object detection; 24 MP and above for high-density work such as aerial survey and medical imaging |
| Format | RAW (12 / 14 / 16-bit full sensor data); PNG for lossless delivery; JPEG only at quality 100 |
| Sensor consideration | Pixel pitch governs low-light SNR; full-frame sensors are specified where shadow detail matters |
| Diversity sampling | Combinatorial matrix of subject, angle, distance, and lighting |
| Synthetic-data rejection | Cryptographic and visual checks confirm physical-camera capture; diffusion-model output is rejected |
| Annotation readiness | High acutance without over-sharpening halos; lens distortion (barrel, pincushion) corrected before annotation; CVAT-ready for polygon, keypoint, and 3D bounding-box work |
3.4 Quality-control gates
Collection quality is verified, not assumed. Every modality runs the same two-layer quality-control pipeline before a delivery is signed.
The GDPR metadata-scrub principle (PII and geolocation stripped at intake) is a stated YPAI practice.
3.5 Provenance and lineage
Each payload receives a microsecond-precision timestamp at capture, a contributor identifier hashed with SHA-256, a device identifier, a session identifier, and a consent record reference. Lineage events are appended to an immutable audit log. The lineage manifest is emitted at delivery time as a JSON artifact.
{
"_comment": "[SAMPLE] structure only; production manifests are issued per delivery",
"manifest_id": "ypai-mf-2026-Q2-c8d7",
"delivery_id": "ypai-2026-Q2-AB12",
"modality": "audio_speech",
"captured_window": {
"start": "2026-04-01T08:14:22.184739Z",
"end": "2026-04-12T17:48:09.502611Z"
},
"contributors": {
"count": 142,
"identifier_hash_alg": "sha256",
"country_distribution": { "NO": 38, "SE": 22, "DK": 19, "FI": 17, "DE": 26, "FR": 20 }
},
"device_fleet": {
"audio_interface": "Focusrite Scarlett 2i2",
"microphone": "sE2200"
},
"consent_records": {
"scheme": "GDPR Art. 7",
"record_count": 142,
"withdrawal_endpoint": "https://contributor-portal.ypai.ai/api/v1/withdraw"
},
"lineage_log_root_hash": "sha256:7f2a...e93c",
"manifest_signature_alg": "ed25519",
"manifest_signature": "[SAMPLE]"
} Identity, Access, and Annotation Infrastructure
YPAI operates three annotation environments. CVAT (self-hosted) handles bounding box, semantic and instance segmentation, keypoint, and polygon workflows on image and video. Label Studio (self-hosted) handles timeline and frame-by-frame video, audio waveform tasks, and text annotation. Proprietary tooling handles workflows that exceed the open-source tooling capability: multi-track speaker diarization with paired-dialogue context, LiDAR plus camera fusion review, and per-contributor consent workflow integration.
Each customer engagement provisions a dedicated annotation VPC. The VPC has no shared database boundary with any other engagement. Reviewer workstations connect through a hardened jump host. Data replication from the data plane to the annotation VPC is one-way and authenticated. No annotation workload egresses to the public internet.
Reviewers authenticate against the corporate identity provider via SAML 2.0 or OIDC. MFA is enforced via WebAuthn or TOTP. Workstation sessions inherit short-lived tokens, time-bound to reviewer shifts. RBAC binds five role classes (viewer, annotator, senior annotator, qa_reviewer, engagement_lead) to a permission matrix scoped to the engagement VPC. Cross-engagement access is structurally impossible because IAM policy is scoped at the VPC boundary.
Quality Assurance and Evaluation
Production annotation runs a minimum of two independent passes per task. The system compares outputs using a task-appropriate consensus algorithm. Bounding-box and segmentation tasks compare via Intersection over Union (IoU). Sequence-labeling tasks compare via token-level F1. Audio diarization tasks compare via Diarization Error Rate (DER) on the consensus segments. Disagreements above a configured threshold escalate to a senior reviewer.
Each engagement starts with a gold-set definition: a small set of representative tasks with reviewer-validated correct outputs. Reviewers must pass calibration on the gold set before entering the production queue. The gold set is versioned. Production reviewers periodically re-run gold tasks blind; failed re-runs gate the reviewer out of the queue pending re-calibration. Inter-annotator agreement is tracked over rolling windows. The IAA history ships in the per-engagement evidence package.
Data Delivery and Retention
Three default transport options: customer-elected SFTP endpoint with customer-managed credentials; customer-managed object storage (S3-compatible) with customer-provided KMS keys; physical secure media for very-large or air-gap-required deliveries. Customer-managed keys are supported in all software paths. Private network options (AWS PrivateLink, Azure ExpressRoute) are configurable on engagement scoping.
Output formats include JSON-LD with manifest.sha256 (default), COCO, Pascal VOC, TFRecord, Parquet, WebDataset, KITTI, and customer-negotiated custom formats.
# [SAMPLE] Verify a YPAI delivery manifest signature and integrity
curl -sS \
-H "Authorization: Bearer ${ENGAGEMENT_TOKEN}" \
-H "Accept: application/json" \
"https://delivery.ypai.ai/v1/manifests/${DELIVERY_ID}" \
| jq '.manifest_signature, .lineage_log_root_hash' {
"_comment": "[SAMPLE]",
"delivery_id": "ypai-2026-Q2-AB12",
"modality": "audio_speech",
"jurisdiction": "EEA / Norway",
"consent": { "record_set_id": "ypai-cs-2026-Q2-AB12", "scheme": "GDPR Art. 7", "record_count": 142 },
"lineage": { "contributors": 142, "devices": 12, "log_root_hash": "sha256:7f2a...e93c" },
"qa": { "ai_act_article": "10", "sample_rate": 0.05, "reviewers": 9, "passes_per_task": 2 },
"dpa": { "status": "executed", "version": "ypai-dpa-2026-Q1" },
"delivery_format": "JSON-LD + manifest.sha256",
"transport": "SFTP",
"encryption_at_rest": "AES-256-GCM",
"manifest_signature_alg": "ed25519",
"manifest_signature": "[SAMPLE]"
} Default retention windows: source captures retained for the engagement contractual window; annotated outputs retained for the contractual audit window; QA evidence retained for the contractual audit window. After contractual expiry, automated destruction executes against the data-plane storage and replicated annotation copies; destruction is logged and the log is delivered to the customer on request. See DPA terms.
Regulatory Compliance and Auditability
The evidence package is the operative artifact. A customer conducting its own conformity assessment under EU AI Act Article 10 receives every artifact the assessment requires, framework by framework, with each control mapped below.
7.1 GDPR
YPAI operates under EU/EEA corporate structure with EEA-resident data infrastructure as a baseline. Specific GDPR provisions map to specific operational controls.
| Regulation | YPAI control | Evidence artifact |
|---|---|---|
| Article 6(1)(a) Consent as lawful basis | Per-contributor signed consent before any capture | Consent record JSON-LD |
| Article 7 Conditions for consent | Specific, informed, unambiguous, withdrawable consent; consent_language_version captured | Consent record + language version registry |
| Article 9 Special category data | Explicit opt-in scope (biometric voice, medical imaging) | Consent record `scope` field |
| Article 17 Right to erasure | 30-day erasure workflow cascading to replicas | Erasure log entry per request |
| Article 28 Processor obligations | YPAI-issued DPA executed before any data flow | Signed DPA artifact |
| Article 48 Third-country transfer | EEA-resident storage as default; SCCs available on customer-elected transfer | SCCs annex; data residency manifest field |
7.1.1 GDPR Article 7 consent record sample
{
"_comment": "[SAMPLE]",
"consent_record_id": "ypai-cr-2026-04-87234",
"contributor_id_hash": "sha256:a3f9...c4d2",
"modality": "audio_speech",
"signed_at": "2026-04-12T09:34:18Z",
"legal_basis": "GDPR Article 7",
"consent_language_version": "en-1.4",
"scope": ["asr_training", "tts_training", "evaluation"],
"withdrawal_endpoint": "https://contributor-portal.ypai.ai/api/v1/withdraw",
"ip_address_hash": "sha256:b8e1...f7a9",
"agreement_hash": "sha256:c9d2...e5b8",
"ttl_days": null
} 7.2 EU AI Act Article 10
Article 10 governs training, validation, and testing data for high-risk AI systems. The YPAI delivery package supports the customer conformity assessment under each sub-clause. Effective date: 2 August 2026.
| Regulation | YPAI control | Evidence artifact |
|---|---|---|
| Article 10(1) Quality criteria | Multi-pass annotation with IoU, DER, F1 thresholds per task type | QA result JSON; threshold table per engagement |
| Article 10(2) Data governance practices | Provenance log per sample; annotation guideline versioning; bias detection workflow | Lineage manifest; guideline registry version |
| Article 10(3) Representativeness | Contributor demographic distribution recorded per delivery | Manifest `contributors.country_distribution` block |
| Article 10(3) Error validation | Gold-set governance and inter-annotator agreement tracking | IAA report; gold-set version |
| Article 10(5) Special-category processing | Explicit opt-in scope field on contributor consent record | Consent record `scope` field |
7.3 DORA
For financial-sector clients, DORA mandates ICT risk management and third-party oversight. The single-tenant VPC architecture addresses operational-resilience testing without multi-tenant cascading-failure exposure. Specific RTO and RPO targets are documented per engagement.
| Regulation | YPAI control | Evidence artifact |
|---|---|---|
| ICT risk management (Art. 5 to 15) | Documented architecture, single-tenant isolation per engagement | Architecture document; access control matrix |
| Third-party ICT risk (Art. 28) | DPA, processor scope, sub-processor disclosure | DPA + sub-processor list |
| Incident reporting (Art. 17 to 23) | Documented incident response, log retention | Incident response runbook |
| Operational resilience testing (Art. 24 to 27) | Single-tenant blast radius; isolated recovery procedures | Recovery procedure document |
7.4 TISAX AL3
TISAX Assessment Level 3 covers data with very high need for protection (automotive prototype data, intellectual property). YPAI controls align with AL3 baseline; YPAI is not currently TISAX-labeled, alignment is the operational posture.
| Regulation | YPAI control | Evidence artifact |
|---|---|---|
| Physical security | Documented facility access controls; reviewer workstation hardening | Facility statement; workstation policy |
| Logical access | RBAC, MFA via WebAuthn or TOTP, jump host for VPC entry | IAM matrix; auth log |
| Encryption (transit and at rest) | TLS 1.3 + AES-256-GCM | Architecture statement |
| Air-gap option | Provisioned per engagement on request; no public-internet path | Engagement scope document |
| Prototype data segregation | Single-tenant VPC per engagement | Network topology diagram |
| TISAX label status | YPAI is NOT currently TISAX-labeled. Controls align with AL3 baseline. | Self-attestation only; not third-party assessed |
7.5 US CLOUD Act counter-position
The US CLOUD Act compels US-headquartered providers to surrender data to US law enforcement regardless of physical hosting location. This conflicts with GDPR Article 48. The YPAI Norwegian corporate structure and EEA infrastructure operate outside US jurisdiction. The brief states this as architectural fact, not a comparison.
Appendix
8.1 Glossary
AES-256-GCM (encryption at rest), CMK (Customer-Managed Key), CVAT (Computer Vision Annotation Tool), DER (Diarization Error Rate), DORA (Digital Operational Resilience Act), DPA (Data Processing Agreement), ed25519 (signature scheme), GDPR (General Data Protection Regulation), IAA (Inter-Annotator Agreement), IAM (Identity and Access Management), IoU (Intersection over Union), JSON-LD (JSON Linked Data), MTPE (Machine Translation Post-Editing), OIDC (OpenID Connect), RBAC (Role-Based Access Control), SAML 2.0 (Security Assertion Markup Language), SCCs (Standard Contractual Clauses), SFTP (SSH File Transfer Protocol), TISAX (Trusted Information Security Assessment Exchange), TLS 1.3 (Transport Layer Security), VPC (Virtual Private Cloud), WebAuthn (Web Authentication).
8.2 Version history
| Version | Date | Changes |
|---|---|---|
| v1.0 | 2026-05-14 | Initial publication. |
8.3 External references
- EU AI Act Article 10: artificialintelligenceact.eu/article/10
- GDPR Article 7: gdpr-info.eu/art-7-gdpr
- GDPR Article 17: gdpr-info.eu/art-17-gdpr
- GDPR Article 48: gdpr-info.eu/art-48-gdpr