TEXT DATA / SEVEN CAPABILITY LANES
Seven text-data capabilities.150+ languages. Linguist-staffed.
From flat NER through preference data and red-team evaluation, the full text-data stack for LLM and NLP teams. Native-speaker linguists, EEA-resident processing, kappa reports per delivery.
- Annotation through RLHF
- 150+ languages
- EEA processing
PROCUREMENT READINESS
Compliance posture for text training and evaluation data.
Article 10 enforcement begins 2 August 2026. YPAI ships every text engagement with the artifacts a regulated buyer needs in their file.
Compliance posture
EU AI Act Article 10
Data and data governance. Annotation guideline, gold set, kappa report, and error analysis included in every delivery.
GDPR Articles 7, 28, 30
Per-contributor consent records (Article 7). Processor agreement (Article 28). Records of processing (Article 30). 30-day erasure SLA.
EEA-resident processing
Norwegian company structure, EEA contributor network, EEA processing. Outside US CLOUD Act reach.
We map the evidence package to your data, risk class, and deployment environment.
WHAT WE LABEL
Seven lanes across the text-data lifecycle.
The capabilities a modern LLM or NLP team actually procures, in one provider relationship.
| Lane | Sub-tasks | Schema / format | Eval metric |
|---|---|---|---|
| Semantic (NER + linking + relations) | Flat NERnested NERentity linkingrelation extractioncoreference | CoNLL-U, BIO/IOB, BRAT, spaCy JSON, BioC | Span F1, Cohen's kappa |
| Pragmatic (sentiment / intent / emotion / stance) | PolarityABSAPlutchik or Ekman or VADintentstanceclaim-evidence | JSON-L with per-annotator IDs | Krippendorff alpha, macro F1, micro F1 |
| Classification (topic, multi-label, hierarchical, toxicity) | Flatmulti-labelhierarchical (ICD-10)topicroutingtoxicityPII | JSON-L, taxonomies in SKOS | Hierarchical F1, calibration ECE, FPR-at-low-FNR |
| Linguistic (POS / dependency / coref / morphology / WSD) | UD POS and dependencyconstituencycoreferencemorphologylemmatizationWSD | CoNLL-U (UD), Penn Treebank | LAS, UAS, CoNLL-F1 |
| Generation (summarization) | Extractiveabstractivequery-focusedlong-documentmeeting with attribution | JSON-L with source-span links | BERTScore, FActScore, SummaC, human rubric |
| Preference + safety (RLHF) | Pairwise or n-way preferencecritique writingred-teamfactualityDPO/IPO/KTO | HH-RLHF JSON-L, custom with metadata | Preference agreement, win-rate, MT-Bench-style |
| Document understanding | OCRreading orderlayoutstructured extractiontable-QAsignatures | PAGE-XML, ALTO, hOCR, FUNSD | CER, WER, field F1, ANLS |
ANNOTATION TAXONOMY
Where text annotation fits in the LLM stack.
Pre-training corpora, SFT alignment data, RLHF preference pairs, eval and red-team. Every layer has a labeling problem.
HOW WE LABEL
Every project clears the same seven gates.
Calibration before production. Documented kappa thresholds. 100% human QA. The artifacts you need for your Article 10 file.
Schema design
Co-designed with your team. Versioned. Tied to the eval metric you will run against.
Deliverable: Versioned schema spec
Annotation guidelines
Edge cases enumerated, examples per class, glossary of domain terms. Updated as adjudication surfaces new cases.
Deliverable: Annotation guideline document
Calibration round
Pilot batch on a shared subset. Disagreements adjudicated before scale. Guidelines refined on findings.
Deliverable: Calibration kappa report
IAA gate
Production starts only when calibration clears Landis-Koch substantial agreement (kappa 0.6+) on the schema.
Deliverable: Gate-pass attestation
Production labeling
Trained linguists, native speakers for multilingual. Annotator selection criteria documented per project.
Deliverable: Labeled batches
100% human QA + adjudication
Every batch reviewed. Gold-set items injected at a documented rate. Disagreements above threshold escalate to a lead annotator with adjudication trail recorded.
Deliverable: QA review log + adjudication record
Delivery with kappa report
Per-schema metrics, error analysis, audit trail, and Article 30 records ship with the dataset.
Deliverable: Final delivery pack
Schema design
Co-designed with your team. Versioned. Tied to the eval metric you will run against.
Deliverable: Versioned schema spec
Annotation guidelines
Edge cases enumerated, examples per class, glossary of domain terms. Updated as adjudication surfaces new cases.
Deliverable: Annotation guideline document
Calibration round
Pilot batch on a shared subset. Disagreements adjudicated before scale. Guidelines refined on findings.
Deliverable: Calibration kappa report
IAA gate
Production starts only when calibration clears Landis-Koch substantial agreement (kappa 0.6+) on the schema.
Deliverable: Gate-pass attestation
Production labeling
Trained linguists, native speakers for multilingual. Annotator selection criteria documented per project.
Deliverable: Labeled batches
100% human QA + adjudication
Every batch reviewed. Gold-set items injected at a documented rate. Disagreements above threshold escalate to a lead annotator with adjudication trail recorded.
Deliverable: QA review log + adjudication record
Delivery with kappa report
Per-schema metrics, error analysis, audit trail, and Article 30 records ship with the dataset.
Deliverable: Final delivery pack
Seven gates. One trail of evidence. Every delivery.
DOMAIN DEPTH
Three regulated verticals where generic crowdsource fails.
Medical ontologies, legal benchmarks, financial regulation. Domain knowledge plus YPAI infrastructure. Annotation only; not medical, legal, or investment advice.
Medical NLP
Clinical-note de-identification, ICD-10 / ICD-11 coding, SNOMED-CT and RxNorm linking, MedDRA pharmacovigilance, FHIR R4 export.
Legal NLP
Contract clause annotation (CUAD), LegalBench-style task suites, jurisdiction tagging, eDiscovery TAR, DORA and NIS2 rule extraction.
Financial NLP
FOMC and ECB sentiment, earnings-call transcripts, ESG taxonomy and greenwashing, KYC and AML, MiFID II suitability, market abuse.
150+ LANGUAGES / NATIVE SPEAKERS
Where the LLM data gap actually lives.
LLM training data is roughly 45% English. EU AI Act Article 55 expects per-language disclosure across the 24 EU official languages. Most providers crowdsource translations; we staff native speakers.
58 languages shown / 150+ supported. FLORES-200 baseline; 200 languages total. Per-project native-speaker availability confirmed on brief.
YPAI BY THE NUMBERS
HUMAN QA
KAPPA THRESHOLD
ERASURE SLA
VETTED CONTRIBUTORS
INDUSTRIES WE SERVE
Built for Buyers With Operational Risk
Healthcare, mobility, finance, government, and industrial teams need traceable data, defensible QA, and delivery records that survive review.
Defense & Government
Controlled data workflows for public-sector and defence-adjacent AI deployments.
Healthcare & Life Sciences
Clinical NLP, medical imaging, and consent-aware healthcare datasets.
Financial Services
Document AI, model governance support, and risk-focused data workflows.
Automotive & Mobility
ADAS, in-cabin AI, mobility datasets, and safety-sensitive annotation.
Manufacturing & Industrial
Quality vision, industrial data capture, and reporting automation support.
Enterprise & AI
LLM training data, RAG evaluation, RLHF, and benchmark datasets.
Or describe the operating environment and YPAI will scope the data path.
WHAT YOU RECEIVE
Every delivery ships with the artifact pack your Article 10 file needs.
The records a regulated buyer expects, included with every text engagement. No upgrade tier, no separate request.
Annotation guideline and edge cases.
Versioned guideline with edge cases enumerated, examples per class, and glossary of domain terms. Updated as adjudication surfaces new patterns; every version preserved for audit.
Gold set with injection rate.
Held-out gold set; undisclosed injection rate during production; per-annotator accuracy gates. Calibration before queue entry, blind re-runs on failure.
Per-schema kappa or alpha report.
Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha computed per schema and per language pair on every delivery. Span tasks ship with per-class F1 as primary metric.
Error analysis with top failure modes.
Per-class confusion matrix, top failure modes by frequency, and recommended schema or guideline updates. The signal you need to decide whether to retrain or re-label.
Records of processing, DPA, and sub-processor list.
Article 30 records of processing, signed Article 28 DPA, lawful-basis documentation, 30-day erasure SLA, and full sub-processor list with Article 28(2) change notifications. All included with every engagement.
START A PROJECT
Brief us. We reply within one business day.
Short brief now, deeper scoping in the reply.
Frequently asked questions
Annotation, SFT data, RLHF preference pairs, red-team, summarization, document understanding, and evaluation. Seven lanes covered by trained linguists; see the capability matrix for sub-tasks, schema, and eval metrics.
The contributor pool covers the 24 EU official languages, EU minority and Nordic languages, Indic and Southeast Asian languages, and a growing African low-resource lane. Per-project native-speaker availability confirmed on brief. Total of 150+ supported.
Annotation guideline (versioned), gold set, per-schema Cohen's kappa or Krippendorff alpha report, error analysis, Article 30 records of processing, signed DPA, and sub-processor list. Per-class F1 for span tasks. No accuracy SLAs without your gold set.
EEA-resident. Norwegian company, EEA contributor network, EEA infrastructure. 30-day GDPR erasure SLA. Outside US CLOUD Act reach.
Screened opt-in. Exposure limits per shift. Mandatory rotation off sensitive content. Counselling access. The specifics are documented in the project brief; we will not lead you to assume the duty-of-care obligation here is small.
Brief us on your text-data project.
One business day reply. NDA on request. DPA included.
Your information is never shared. We respond with the next scoping step.