Skip to content
Version
v1.0
Last updated
Audience
For: ML, Data Eng, CISO, Procurement

Data Collection Infrastructure at YPAI

This brief documents the architecture, data flows, evidence artifacts, and regulatory mappings of YPAI multimodal data collection infrastructure. It targets engineering, security, and procurement teams conducting vendor diligence. The document supersedes prior architectural summaries.

Section 1

Executive Summary

YPAI is a Norwegian Aksjeselskap headquartered in Oslo, operating with EEA-resident infrastructure. The platform covers data collection, annotation, validation, and delivery across audio and speech, image, video, text, LiDAR and 3D point clouds, and sensor modalities. The contributor network spans 150+ languages and 50+ countries.

Four structural pillars differentiate the platform from US-headquartered vendors: Norwegian corporate jurisdiction outside US CLOUD Act reach, self-hosted single-tenant annotation infrastructure on CVAT, Label Studio, and proprietary tooling, per-contributor cryptographic consent records aligned with GDPR Article 7, and microsecond-precision multimodal synchronization.

Section 2

Architectural Overview

Data flows through five stages: collect, ingest, annotate, QA, deliver. Edge devices encrypt payloads using TLS 1.3 before transmission. The ingestion gateway authenticates devices via mutually authenticated certificates, decrypts the transport layer, and re-encrypts the payload at rest using AES-256-GCM with customer-scoped envelope keys.

The annotation environment ingests data from the storage layer through a one-way replication channel. QA workflows operate on annotated outputs only. The delivery service signs and ships datasets through customer-elected transport: SFTP, SCCs-backed object storage transfer, or customer-managed key escrow.

Five zero-trust principles operate at every stage: mutual TLS at every hop, no implicit trust between stages, per-stage RBAC, immutable audit logging, default-deny network ACLs. The YPAI is a Norwegian company, which places customer data outside the reach of US CLOUD Act compulsion and aligns with GDPR Article 48 by default. See provenance audit detail.

Section 3

Data Collection and Provenance

Modality coverage spans audio and speech, video, image, LiDAR and 3D point cloud, sensor (radar, IoT, wearables, environmental), and text and NLP. Collection quality is not a post-processing step. For each modality, YPAI specifies the capture parameters, hardware, and signal targets before a session begins, and verifies them against named industry standards: ISO/IEC 5259-4 for data-quality frameworks, ITU-T P.56 for active speech level, EBU R128 and ITU-R BS.1770 for loudness, and SMPTE ST 2110 for multi-sensor synchronization.

3.1 Voice and speech

Voice data quality is bounded by capture physics, not by post-processing. YPAI specifies the acoustic environment, the capture hardware, and the signal targets before a session begins. Active speech level is computed per ITU-T P.56; loudness is normalized per EBU R128 and ITU-R BS.1770.

Voice and speech capture specification
Parameter Specification
Sample rate16 kHz for ASR baselines; 44.1 / 48 kHz for TTS, voice cloning, and high-fidelity work
Bit depth16-bit (96 dB dynamic range) baseline; 24-bit (144 dB) for studio, TTS, and automotive captures
Container and codecPCM WAV or FLAC, lossless only. Lossy formats (MP3, AAC) are excluded from training corpora
ChannelsMono for TTS; multi-track or microphone-array capture for diarization, spatial, and far-field work
Capture hardwareLarge-diaphragm condenser microphones (sE2200 class, self-noise below 12 dB(A)) into a 24-bit interface (Focusrite Scarlett 2i2 class)
Language coverage150+ languages
Contributor network40,000+ vetted contributors across 50+ countries

YPAI classifies every voice engagement into one of four acoustic environment classes, each with a defined signal target. The target is set at scoping and verified at quality control.

Acoustic environment classes
Environment class SNR target Noise floor Reverberation
Studioabove 40 dBbelow -60 dBFSRT60 below 0.2 s
Quiet indoor / office30 dB or abovebelow -45 dBFScontrolled
In-cabin (automotive)10 to 25 dBenvironment-dependentenvironment-dependent
Street / far-field5 to 15 dBenvironment-dependentenvironment-dependent

Voice engagements specify stratified age bands, gender balance, and regional accent spread. Read-speech protocols enforce phonetic balance; spontaneous-speech protocols capture natural disfluency; code-switching prompts support multilingual models.

3.2 Video

Video data quality is set by sensor capability, frame cadence, and synchronization. Multi-camera capture runs on SMPTE ST 2110 transport with PTP (Precision Time Protocol) for sub-frame alignment across the sensor array.

Video capture specification
Parameter Specification
Resolution1080p baseline; 4K UHD for dense-scene and autonomous-vehicle work; 8K for wide-crop and remote sensing
Frame rate24 / 30 fps for standard capture; 60 / 120 fps for high-speed motion and AV perception
CodecH.264 / H.265 for delivery; ProRes / DNxHR for edit-grade intra-frame; RAW where the pipeline requires it
Bit depth8-bit standard; 10-bit for HDR captures
Chroma subsampling4:2:0 for general capture; 4:2:2 / 4:4:4 for segmentation and edge-precise annotation
Multi-camera syncSMPTE ST 2110 transport with PTP (Precision Time Protocol) sub-frame alignment

Video capture spans a defined environmental matrix: lighting conditions (daylight, overcast, golden hour, twilight, low-light), weather (rain, snow, fog), and controlled motion and occlusion scenarios.

3.3 Image

Image data quality is bounded by sensor physics and is increasingly threatened by synthetic-data contamination. YPAI specifies optical floors and rejects generative output at intake, so the corpus reflects physical capture rather than diffusion-model artifacts.

Image capture specification
Parameter Specification
Resolution floor12 MP minimum for general object detection; 24 MP and above for high-density work such as aerial survey and medical imaging
FormatRAW (12 / 14 / 16-bit full sensor data); PNG for lossless delivery; JPEG only at quality 100
Sensor considerationPixel pitch governs low-light SNR; full-frame sensors are specified where shadow detail matters
Diversity samplingCombinatorial matrix of subject, angle, distance, and lighting
Synthetic-data rejectionCryptographic and visual checks confirm physical-camera capture; diffusion-model output is rejected
Annotation readinessHigh acutance without over-sharpening halos; lens distortion (barrel, pincushion) corrected before annotation; CVAT-ready for polygon, keypoint, and 3D bounding-box work

3.4 Quality-control gates

Collection quality is verified, not assumed. Every modality runs the same two-layer quality-control pipeline before a delivery is signed.

The GDPR metadata-scrub principle (PII and geolocation stripped at intake) is a stated YPAI practice.

3.5 Provenance and lineage

Each payload receives a microsecond-precision timestamp at capture, a contributor identifier hashed with SHA-256, a device identifier, a session identifier, and a consent record reference. Lineage events are appended to an immutable audit log. The lineage manifest is emitted at delivery time as a JSON artifact.

lineage-manifest.json json
{
  "_comment": "[SAMPLE] structure only; production manifests are issued per delivery",
  "manifest_id": "ypai-mf-2026-Q2-c8d7",
  "delivery_id": "ypai-2026-Q2-AB12",
  "modality": "audio_speech",
  "captured_window": {
    "start": "2026-04-01T08:14:22.184739Z",
    "end": "2026-04-12T17:48:09.502611Z"
  },
  "contributors": {
    "count": 142,
    "identifier_hash_alg": "sha256",
    "country_distribution": { "NO": 38, "SE": 22, "DK": 19, "FI": 17, "DE": 26, "FR": 20 }
  },
  "device_fleet": {
    "audio_interface": "Focusrite Scarlett 2i2",
    "microphone": "sE2200"
  },
  "consent_records": {
    "scheme": "GDPR Art. 7",
    "record_count": 142,
    "withdrawal_endpoint": "https://contributor-portal.ypai.ai/api/v1/withdraw"
  },
  "lineage_log_root_hash": "sha256:7f2a...e93c",
  "manifest_signature_alg": "ed25519",
  "manifest_signature": "[SAMPLE]"
}
Section 4

Identity, Access, and Annotation Infrastructure

YPAI operates three annotation environments. CVAT (self-hosted) handles bounding box, semantic and instance segmentation, keypoint, and polygon workflows on image and video. Label Studio (self-hosted) handles timeline and frame-by-frame video, audio waveform tasks, and text annotation. Proprietary tooling handles workflows that exceed the open-source tooling capability: multi-track speaker diarization with paired-dialogue context, LiDAR plus camera fusion review, and per-contributor consent workflow integration.

Each customer engagement provisions a dedicated annotation VPC. The VPC has no shared database boundary with any other engagement. Reviewer workstations connect through a hardened jump host. Data replication from the data plane to the annotation VPC is one-way and authenticated. No annotation workload egresses to the public internet.

Reviewers authenticate against the corporate identity provider via SAML 2.0 or OIDC. MFA is enforced via WebAuthn or TOTP. Workstation sessions inherit short-lived tokens, time-bound to reviewer shifts. RBAC binds five role classes (viewer, annotator, senior annotator, qa_reviewer, engagement_lead) to a permission matrix scoped to the engagement VPC. Cross-engagement access is structurally impossible because IAM policy is scoped at the VPC boundary.

Section 5

Quality Assurance and Evaluation

Production annotation runs a minimum of two independent passes per task. The system compares outputs using a task-appropriate consensus algorithm. Bounding-box and segmentation tasks compare via Intersection over Union (IoU). Sequence-labeling tasks compare via token-level F1. Audio diarization tasks compare via Diarization Error Rate (DER) on the consensus segments. Disagreements above a configured threshold escalate to a senior reviewer.

Each engagement starts with a gold-set definition: a small set of representative tasks with reviewer-validated correct outputs. Reviewers must pass calibration on the gold set before entering the production queue. The gold set is versioned. Production reviewers periodically re-run gold tasks blind; failed re-runs gate the reviewer out of the queue pending re-calibration. Inter-annotator agreement is tracked over rolling windows. The IAA history ships in the per-engagement evidence package.

Section 6

Data Delivery and Retention

Three default transport options: customer-elected SFTP endpoint with customer-managed credentials; customer-managed object storage (S3-compatible) with customer-provided KMS keys; physical secure media for very-large or air-gap-required deliveries. Customer-managed keys are supported in all software paths. Private network options (AWS PrivateLink, Azure ExpressRoute) are configurable on engagement scoping.

Output formats include JSON-LD with manifest.sha256 (default), COCO, Pascal VOC, TFRecord, Parquet, WebDataset, KITTI, and customer-negotiated custom formats.

verify-delivery.sh bash
# [SAMPLE] Verify a YPAI delivery manifest signature and integrity
curl -sS \
  -H "Authorization: Bearer ${ENGAGEMENT_TOKEN}" \
  -H "Accept: application/json" \
  "https://delivery.ypai.ai/v1/manifests/${DELIVERY_ID}" \
  | jq '.manifest_signature, .lineage_log_root_hash'
delivery-manifest.json json
{
  "_comment": "[SAMPLE]",
  "delivery_id": "ypai-2026-Q2-AB12",
  "modality": "audio_speech",
  "jurisdiction": "EEA / Norway",
  "consent": { "record_set_id": "ypai-cs-2026-Q2-AB12", "scheme": "GDPR Art. 7", "record_count": 142 },
  "lineage": { "contributors": 142, "devices": 12, "log_root_hash": "sha256:7f2a...e93c" },
  "qa": { "ai_act_article": "10", "sample_rate": 0.05, "reviewers": 9, "passes_per_task": 2 },
  "dpa": { "status": "executed", "version": "ypai-dpa-2026-Q1" },
  "delivery_format": "JSON-LD + manifest.sha256",
  "transport": "SFTP",
  "encryption_at_rest": "AES-256-GCM",
  "manifest_signature_alg": "ed25519",
  "manifest_signature": "[SAMPLE]"
}

Default retention windows: source captures retained for the engagement contractual window; annotated outputs retained for the contractual audit window; QA evidence retained for the contractual audit window. After contractual expiry, automated destruction executes against the data-plane storage and replicated annotation copies; destruction is logged and the log is delivered to the customer on request. See DPA terms.

Section 7

Regulatory Compliance and Auditability

The evidence package is the operative artifact. A customer conducting its own conformity assessment under EU AI Act Article 10 receives every artifact the assessment requires, framework by framework, with each control mapped below.

7.1 GDPR

YPAI operates under EU/EEA corporate structure with EEA-resident data infrastructure as a baseline. Specific GDPR provisions map to specific operational controls.

GDPR control mapping
Regulation YPAI control Evidence artifact
Article 6(1)(a) Consent as lawful basisPer-contributor signed consent before any captureConsent record JSON-LD
Article 7 Conditions for consentSpecific, informed, unambiguous, withdrawable consent; consent_language_version capturedConsent record + language version registry
Article 9 Special category dataExplicit opt-in scope (biometric voice, medical imaging)Consent record `scope` field
Article 17 Right to erasure30-day erasure workflow cascading to replicasErasure log entry per request
Article 28 Processor obligationsYPAI-issued DPA executed before any data flowSigned DPA artifact
Article 48 Third-country transferEEA-resident storage as default; SCCs available on customer-elected transferSCCs annex; data residency manifest field
consent-record.json json
{
  "_comment": "[SAMPLE]",
  "consent_record_id": "ypai-cr-2026-04-87234",
  "contributor_id_hash": "sha256:a3f9...c4d2",
  "modality": "audio_speech",
  "signed_at": "2026-04-12T09:34:18Z",
  "legal_basis": "GDPR Article 7",
  "consent_language_version": "en-1.4",
  "scope": ["asr_training", "tts_training", "evaluation"],
  "withdrawal_endpoint": "https://contributor-portal.ypai.ai/api/v1/withdraw",
  "ip_address_hash": "sha256:b8e1...f7a9",
  "agreement_hash": "sha256:c9d2...e5b8",
  "ttl_days": null
}

7.2 EU AI Act Article 10

Article 10 governs training, validation, and testing data for high-risk AI systems. The YPAI delivery package supports the customer conformity assessment under each sub-clause. Effective date: 2 August 2026.

EU AI Act Article 10 mapping
Regulation YPAI control Evidence artifact
Article 10(1) Quality criteriaMulti-pass annotation with IoU, DER, F1 thresholds per task typeQA result JSON; threshold table per engagement
Article 10(2) Data governance practicesProvenance log per sample; annotation guideline versioning; bias detection workflowLineage manifest; guideline registry version
Article 10(3) RepresentativenessContributor demographic distribution recorded per deliveryManifest `contributors.country_distribution` block
Article 10(3) Error validationGold-set governance and inter-annotator agreement trackingIAA report; gold-set version
Article 10(5) Special-category processingExplicit opt-in scope field on contributor consent recordConsent record `scope` field

7.3 DORA

For financial-sector clients, DORA mandates ICT risk management and third-party oversight. The single-tenant VPC architecture addresses operational-resilience testing without multi-tenant cascading-failure exposure. Specific RTO and RPO targets are documented per engagement.

DORA control mapping
Regulation YPAI control Evidence artifact
ICT risk management (Art. 5 to 15)Documented architecture, single-tenant isolation per engagementArchitecture document; access control matrix
Third-party ICT risk (Art. 28)DPA, processor scope, sub-processor disclosureDPA + sub-processor list
Incident reporting (Art. 17 to 23)Documented incident response, log retentionIncident response runbook
Operational resilience testing (Art. 24 to 27)Single-tenant blast radius; isolated recovery proceduresRecovery procedure document

7.4 TISAX AL3

TISAX Assessment Level 3 covers data with very high need for protection (automotive prototype data, intellectual property). YPAI controls align with AL3 baseline; YPAI is not currently TISAX-labeled, alignment is the operational posture.

TISAX AL3 control mapping
Regulation YPAI control Evidence artifact
Physical securityDocumented facility access controls; reviewer workstation hardeningFacility statement; workstation policy
Logical accessRBAC, MFA via WebAuthn or TOTP, jump host for VPC entryIAM matrix; auth log
Encryption (transit and at rest)TLS 1.3 + AES-256-GCMArchitecture statement
Air-gap optionProvisioned per engagement on request; no public-internet pathEngagement scope document
Prototype data segregationSingle-tenant VPC per engagementNetwork topology diagram
TISAX label statusYPAI is NOT currently TISAX-labeled. Controls align with AL3 baseline.Self-attestation only; not third-party assessed

7.5 US CLOUD Act counter-position

The US CLOUD Act compels US-headquartered providers to surrender data to US law enforcement regardless of physical hosting location. This conflicts with GDPR Article 48. The YPAI Norwegian corporate structure and EEA infrastructure operate outside US jurisdiction. The brief states this as architectural fact, not a comparison.

Section 8

Appendix

8.1 Glossary

AES-256-GCM (encryption at rest), CMK (Customer-Managed Key), CVAT (Computer Vision Annotation Tool), DER (Diarization Error Rate), DORA (Digital Operational Resilience Act), DPA (Data Processing Agreement), ed25519 (signature scheme), GDPR (General Data Protection Regulation), IAA (Inter-Annotator Agreement), IAM (Identity and Access Management), IoU (Intersection over Union), JSON-LD (JSON Linked Data), MTPE (Machine Translation Post-Editing), OIDC (OpenID Connect), RBAC (Role-Based Access Control), SAML 2.0 (Security Assertion Markup Language), SCCs (Standard Contractual Clauses), SFTP (SSH File Transfer Protocol), TISAX (Trusted Information Security Assessment Exchange), TLS 1.3 (Transport Layer Security), VPC (Virtual Private Cloud), WebAuthn (Web Authentication).

8.2 Version history

Version history
Version Date Changes
v1.02026-05-14Initial publication.

8.3 External references

8.4 Internal cross-references

On this page