Key Takeaways
- **Treat Article 10 as a technical specification.** Producing a data governance document does not satisfy the requirement. You must demonstrate to a notified body that your training data met specific standards before the model was trained.
- **Link consent to individual records.** A standalone privacy policy is insufficient. Every data point requires a traceable consent reference to survive a conformity assessment.
- **Automate pre-training bias assessments.** Article 10(3) mandates evaluating training data for discriminatory patterns. Build demographic distribution reporting directly into your annotation pipeline, with timestamps that predate each training run.
- **Version-control all preprocessing operations.** Log normalization, filtering, and augmentation steps at the pipeline level. Auditors require git commit hashes and parameter logs, not verbal descriptions.
- **Deploy compliance-grade infrastructure.** Retrofitting legacy pipelines for auditability is a massive engineering burden. Source training data from providers whose provenance records and annotation logs satisfy Article 10 natively.
Most Companies Will Fail Their First Article 10 Audit — Here’s Why
Your ASR model achieves a 12.6% Word Error Rate (WER) in winter conditions. Your inference latency sits comfortably under 200ms. Your MLOps pipeline is reproducible and monitored. None of this matters to a notified body reviewing your EU AI Act conformity assessment. They are not auditing your model’s performance. They are auditing your training data’s provenance.
That is the disconnect most engineering teams discover too late.
EU AI Act Regulation 2024/1689 Article 10 does not care if your AI works well. It demands proof—via documented technical artifacts—that the data used to train your high-risk AI system met strict governance standards before training began. If you cannot produce that machine-readable evidence, the model cannot legally ship as a high-risk AI system in the EU. Full stop.
This Is an Engineering Problem, Not a Legal One
Article 10 is frequently handed to legal or compliance teams, who produce what looks like compliance: a data governance policy document, a privacy impact assessment, and a signed vendor agreement. These artifacts satisfy nothing under Article 10.
What Article 10 actually requires is a set of auditable technical records: documented data collection procedures that are reproducible, logged preprocessing operations covering normalization, filtering, and augmentation, explicit statements of the assumptions made about what the training data represents, and bias examination records demonstrating that datasets were evaluated for characteristics likely to affect health and safety or lead to prohibited discrimination. These are engineering deliverables. They must exist before the model is trained.
The Stakes Are Not Abstract
Under EU AI Act Article 99, violations of Article 10’s data governance requirements carry fines of up to 3% of global annual turnover. For a Fortune 500 company with $50B in annual revenue, that is a $1.5B liability.
Article 43 establishes the conformity assessment process that high-risk AI systems must pass before EU market access is granted. A notified body conducting that assessment will request your data governance documentation directly. A PDF policy and a checkbox do not constitute documentation. Reproducible data collection procedures, preprocessing logs, and bias examination records do. Most teams are building excellent models on a foundation that cannot survive this audit.
What EU AI Act Article 10 Actually Requires Engineers to Build
Article 10 is a technical specification for a data governance system. It must exist before training begins, persist for a decade after the model ships, and be producible on demand for a notified body. Reading it as a set of engineering deliverables is the only framing that produces artifacts capable of surviving an audit.
Here is what Articles 10(2) through 10(5) require in concrete terms.
Article 10(2) mandates documented data governance practices: the design choices behind data source selection, reproducible data collection procedures, logged preprocessing operations, and explicit statements of the assumptions embedded in the data—what population it represents, under what conditions it was collected, and what it was never intended to represent.
Article 10(3) requires that training, validation, and test datasets be examined for biases likely to affect health and safety or lead to prohibited discrimination. This requires documented representativeness assessments covering geographic, contextual, and demographic coverage. Articles 10(3)(f) and (g) add requirements for error freedom and completeness—documented thresholds with a stated rationale for what level of error or incompleteness was deemed acceptable and why.
Article 10(5) introduces a narrow exception permitting the processing of sensitive data categories—including special categories under GDPR Article 9—when necessary to detect and correct bias in high-risk AI systems. This requires explicit purpose limitation, additional technical and organizational safeguards, and documented deletion protocols once the bias examination is complete. Teams treating Article 10(5) as a general license to include sensitive data in training sets will fail the conformity assessment and expose the organization to compounding GDPR liability.
Data Governance as Code: The Six Artifacts You Need
Each Article 10 requirement maps to a concrete artifact. These six form the minimum viable data governance record for a high-risk AI system:
- Data source registry with provenance metadata — origin, collection method, consent framework reference, and chain of custody for every dataset used in training, validation, and testing.
- Preprocessing operation log with version control — a reproducible, timestamped record of every transformation applied to the data, including the software version and parameters used.
- Feature selection rationale document — the documented reasoning for which inputs were included, which were excluded, and why, including any proxy variables that could introduce prohibited discrimination.
- Bias examination report per training dataset — a structured evaluation of each dataset against the demographic, geographic, and contextual dimensions relevant to the model’s intended use case, with findings and remediation steps recorded.
- Representativeness gap analysis — a documented comparison between the population the training data represents and the population the deployed model will encounter, including known gaps and their expected impact on model accuracy.
- Error-rate measurement methodology and results — the testing protocol, acceptable error thresholds, and measured results for the training, validation, and test splits, with the rationale for why the thresholds were set where they were.
Each of these artifacts must be machine-readable and auditable. A Word document in a shared drive fails the reproducibility requirement under Article 11, which references Article 10 data governance records as components of the mandatory technical documentation package. Engineering teams must produce these artifacts as part of a standard ML workflow.
The 10-Year Documentation Clock
Article 72 of the EU AI Act requires providers to retain technical documentation—including all Article 10 data governance records—for 10 years after an AI system is placed on the market or put into service.
If your team trains a model in 2026 and ships it in 2027, a notified body or market surveillance authority can request the complete data governance record in 2037. Cloud storage buckets with no lifecycle governance, annotation platform exports saved to a shared drive, and preprocessing scripts that exist only in a departed engineer’s local environment are liability exposures with a 10-year fuse. You need a governed artifact store: versioned, access-controlled, with retention policies explicitly set to satisfy Article 72.
Three Failure Modes That Compliance Theater Misses
Most high-risk AI teams believe they are compliant. That false confidence is the primary risk. The three failure modes below result from building a compliance strategy around documentation optics rather than engineering reality. Each one will fail a conformity assessment under EU AI Act Article 43.
Failure Mode 1: The Post-Hoc Documentation Trap
A team builds a model using defensible ML practices—proper train/validation/test splits, preprocessing scripts under version control, thoughtful feature selection—but none of it is documented in an auditable format at the time it happens. Six months later, engineers reconstruct the process from memory, Slack threads, and notebook outputs.
Retroactive reconstruction is a narrative, not a documentation artifact.
A notified body conducting a conformity assessment under Article 43 will ask: “Show me the preprocessing log from the date this training run was executed—the software version, the parameters, and the input dataset hash.” If that record was written six months after the fact, it fails the reproducibility standard. Preprocessing logs must be generated by the pipeline natively. Data provenance records must be written at ingestion.
Failure Mode 2: Bias Assessment at the Wrong Stage
Article 10(3) of the EU AI Act requires that training datasets be examined for biases before the model is trained.
Most MLOps pipelines have no pre-training bias evaluation step. Teams run fairness metrics on model predictions. That is model fairness testing. It is not what Article 10(3) requires. A compliant pre-training bias examination pipeline includes demographic distribution analysis of the training corpus, geographic coverage mapping against the intended deployment population, and edge-case gap identification—all documented before the training job starts. A fairness evaluation conducted on the deployed model will not pass scrutiny.
Failure Mode 3: The GDPR–Article 10 Intersection
Training data compliance consists of two simultaneous obligations. GDPR Article 7 requires a documented lawful basis for processing personal data. EU AI Act Article 10 requires data governance records covering provenance, collection procedures, and bias examination. Neither satisfies the other.
If you cannot demonstrate a lawful basis for every data point in your training set—including a complete consent framework with records of processing activities under GDPR Article 30—the dataset is a liability regardless of how thorough your Article 10 documentation is. A notified body will ask for both the GDPR legal basis documentation and the Article 10 data governance record as separate, independently verifiable artifacts.
An Engineering Checklist for Article 10 Data Governance
Compliance theater fails because it relies on undated documentation and post-hoc reports. The following checklist operationalizes Article 10 as an engineering workflow. This checklist applies equally to speech, text, image, video, and LiDAR datasets. An automotive LiDAR training corpus carries the exact same pre-training examination requirements as a medical transcription dataset.
Phase 1: Before You Collect a Single Data Point
Responsible AI starts at collection design. By the time data enters your pipeline, the decisions that determine Article 10(2)(a)–(e) compliance have already been made.
1. High-risk AI classification assessment Determine whether your intended use case falls under Annex III of the EU AI Act. Document the classification decision with legal sign-off. Artifact: classification memo stored in your compliance document repository with a dated signature.
2. Data source registry Create a registry of every planned data source. For each source, record origin, access method, and the legal basis for use. Artifact: versioned data source registry in your data catalog, linked to your GDPR Article 30 records of processing activities.
3. Consent framework per source For any source containing personal data, document the lawful basis under GDPR Article 7 (or Article 9 for special-category data). Obtain your data provider’s consent framework documentation as a separate artifact. Artifact: per-source consent records stored alongside the data source registry, independently retrievable.
4. Representativeness targets Define the intended deployment population. Document geographic coverage, demographic distribution targets, and language or dialect requirements before collection begins. Artifact: representativeness specification document, timestamped before collection start date.
Phase 2: Before You Start a Training Run
Article 10(3) requires bias examination of training datasets before training. The timestamp on your bias report must predate your training job.
5. Preprocessing operation log Every normalization, augmentation, filtering, and sampling operation applied to the dataset must be logged with the version of the script or tool that performed it. Artifact: versioned preprocessing log generated automatically by the pipeline and stored in your experiment tracking system.
6. Bias examination report Run demographic distribution analysis, geographic coverage mapping against your representativeness specification, and edge-case gap analysis. Document findings and remediation steps. Artifact: bias examination report with a timestamp predating the training job start time.
7. Annotation provenance metadata Your annotation pipeline must produce per-annotation provenance records: annotator identifier, timestamp, annotation tool version, and inter-annotator agreement scores. Artifact: provenance metadata file per annotation batch, linked to the dataset version in your data catalog.
8. Data quality validation results Define error-rate thresholds before validation runs. Document the threshold, the measured result, and the disposition decision. Artifact: quality validation report with documented thresholds and outcomes.
Phase 3: After Training, Before Market Placement
9. Technical documentation package (Annex IV) Annex IV of the EU AI Act specifies the technical documentation required for high-risk AI systems. Assemble the complete package—data source registry, consent records, preprocessing logs, bias examination report, annotation provenance metadata, quality validation results—as a unified, cross-referenced artifact set.
10. Retention infrastructure Establish immutable storage with access controls and a documented retrieval procedure to satisfy the 10-year retention requirement under Article 72.
11. Internal audit simulation Assign a team member to request each artifact cold and verify it can be located, retrieved, and understood independently. Gaps found internally are fixable. Gaps found by a notified body are not.
A note on data governance certificates from providers: A data governance certificate issued by your training data provider is valid supporting evidence. YPAI’s annotation pipeline generates provenance metadata and bias examination documentation as native pipeline outputs, mapping directly to items 7 and 8 above. This documentation supports your compliance package, but it does not replace your obligation as the AI system provider to assemble and maintain the complete Annex IV technical documentation.
How Production Data Infrastructure Closes the Article 10 Gap
Article 10 failures stem from infrastructure designed to produce models, not evidence. The audit trail, the provenance metadata, the bias examination records: none of these were requirements when most enterprise AI pipelines were originally architected.
Compliance-grade data infrastructure has five defining characteristics:
- Immutable audit logging — every data access, transformation, and versioning event is written to an append-only log with timestamps and actor identifiers.
- Per-record provenance metadata — each data record carries a chain of custody: source, collection date, consent reference, preprocessing operations applied, and annotation identifiers.
- Consent chain tracking — consent records are linked to individual data records. When a data subject withdraws consent under GDPR Article 7, the affected records can be identified and removed without manual reconstruction.
- Automated bias reporting — demographic distribution and representativeness analysis runs as a pipeline stage. Reports are timestamped and versioned alongside the dataset.
- Version-controlled preprocessing pipelines — every preprocessing operation is reproducible from a pinned version of the pipeline code.
GDPR Article 25—data protection by design and by default—requires that privacy safeguards be built into processing systems from the ground up. The same logic applies to Article 10 auditability: infrastructure that was not designed for compliance cannot be made compliant through documentation alone.
YPAI’s speech data collection and annotation operations are built around this model. Consent frameworks are documented per contributor and linked to individual recordings. Annotation pipelines produce per-annotation provenance records—annotator identifier, timestamp, tool version, inter-annotator agreement scores—as native outputs. Multilingual coverage across 100+ languages supports the representativeness requirements that Article 10(3) imposes on high-risk systems operating across linguistic populations.
High-risk AI categories under Annex III—automotive driver monitoring systems, healthcare diagnostic tools, and financial services credit scoring models—face immediate Article 10 obligations. Retrofitting existing pipelines for Article 10 compliance requires months of data engineering work before a single compliance artifact can be produced. Starting with infrastructure designed for auditability is the difference between a compliance package and compliance theater.
Key Takeaways
- Treat Article 10 as a technical specification. Producing a data governance document does not satisfy the requirement. You must demonstrate to a notified body that your training data met specific standards before the model was trained.
- Link consent to individual records. A standalone privacy policy is insufficient. Every data point requires a traceable consent reference to survive a conformity assessment.
- Automate pre-training bias assessments. Article 10(3) mandates evaluating training data for discriminatory patterns. Build demographic distribution reporting directly into your annotation pipeline, with timestamps that predate each training run.
- Version-control all preprocessing operations. Log normalization, filtering, and augmentation steps at the pipeline level. Auditors require git commit hashes and parameter logs, not verbal descriptions.
- Deploy compliance-grade infrastructure. Retrofitting legacy pipelines for auditability is a massive engineering burden. Source training data from providers whose provenance records and annotation logs satisfy Article 10 natively.
Frequently Asked Questions
Does EU AI Act Article 10 apply if we train exclusively on proprietary internal data?
Yes. Article 10 applies to any high-risk AI system as classified under Article 6 and Annex III, regardless of whether training data is proprietary, licensed, or publicly sourced. The obligation rests with the provider of the high-risk system. If your system falls under Annex III categories, your training data governance practices must satisfy Article 10 before market deployment.
What is the practical difference between GDPR and EU AI Act requirements for training data?
They are complementary obligations. GDPR Article 7 governs lawful consent for personal data collection, Article 9 adds heightened requirements for special-category data, and Article 25 requires data protection by design. EU AI Act Article 10 adds a separate layer: technical documentation of data governance, bias examination, and preprocessing reproducibility specific to AI training use cases. Both sets of requirements must be satisfied simultaneously and demonstrable as independent, verifiable artifacts.
What penalties apply if Article 10 requirements are not met?
EU AI Act Article 99 sets fines for non-compliance with data governance obligations at up to €15 million or 3% of global annual turnover, whichever is higher.
What exactly will a notified body ask for during a training data audit?
Auditors will request timestamped preprocessing logs, per-batch annotation provenance records, and bias examination reports with timestamps that predate the training run. They will verify that consent records link to individual data points. A data governance policy document stored separately from your data infrastructure will fail the audit.
Can we outsource our Article 10 obligations to a third-party data provider?
No. A provider can supply the necessary artifacts—consent-linked records, per-annotation provenance logs, and demographic distribution reports—that satisfy the evidentiary requirements Article 10 demands. However, the legal obligation to assemble and maintain the Annex IV technical documentation remains with you, the AI system provider.
Build Your Article 10 Data Governance Foundation
Audit risk under EU AI Act Article 99 starts at €15 million. YPAI supplies consent-linked records, per-annotation provenance logs, and demographic distribution reports built to satisfy Article 10 from day one. Reduce the documentation burden before your notified body review.