TEXT DATA / SEVEN CAPABILITY LANES

Seven text-data capabilities.150+ languages. Linguist-staffed.

From flat NER through preference data and red-team evaluation, the full text-data stack for LLM and NLP teams. Native-speaker linguists, EEA-resident processing, kappa reports per delivery.

Annotation through RLHF
150+ languages
EEA processing

Scope a Data Project See capability matrix

NATIVE SPEAKERS

Bulgarian Czech 한국어 Português Suomi العربية Magyar Bahasa Català Norsk Polski తెలుగు Türkçe Eesti Kiswahili हिन्दी Galego Svenska Tiếng Việt Føroyskt Latviešu Cymraeg Ελληνικά Yorùbá Македонски Euskara ไทย Български Беларуская Lëtzebuergesch isiXhosa 日本語

PROCUREMENT READINESS

Compliance posture for text training and evaluation data.

Article 10 enforcement begins 2 August 2026. YPAI ships every text engagement with the artifacts a regulated buyer needs in their file.

Compliance posture

Annotation guideline

Gold set

Per-schema kappa report

Error analysis

Records of processing (Article 30)

Signed DPA + sub-processor list

EU AI Act Article 10

Data and data governance. Annotation guideline, gold set, kappa report, and error analysis included in every delivery.

GDPR Articles 7, 28, 30

Per-contributor consent records (Article 7). Processor agreement (Article 28). Records of processing (Article 30). 30-day erasure SLA.

EEA-resident processing

Norwegian company structure, EEA contributor network, EEA processing. Outside US CLOUD Act reach.

Request a Procurement Readiness Brief →

We map the evidence package to your data, risk class, and deployment environment.

WHAT WE LABEL

Seven lanes across the text-data lifecycle.

The capabilities a modern LLM or NLP team actually procures, in one provider relationship.

Lane	Sub-tasks	Schema / format	Eval metric
Semantic (NER + linking + relations)	Flat NERnested NERentity linkingrelation extractioncoreference	CoNLL-U, BIO/IOB, BRAT, spaCy JSON, BioC	Span F1, Cohen's kappa
Pragmatic (sentiment / intent / emotion / stance)	PolarityABSAPlutchik or Ekman or VADintentstanceclaim-evidence	JSON-L with per-annotator IDs	Krippendorff alpha, macro F1, micro F1
Classification (topic, multi-label, hierarchical, toxicity)	Flatmulti-labelhierarchical (ICD-10)topicroutingtoxicityPII	JSON-L, taxonomies in SKOS	Hierarchical F1, calibration ECE, FPR-at-low-FNR
Linguistic (POS / dependency / coref / morphology / WSD)	UD POS and dependencyconstituencycoreferencemorphologylemmatizationWSD	CoNLL-U (UD), Penn Treebank	LAS, UAS, CoNLL-F1
Generation (summarization)	Extractiveabstractivequery-focusedlong-documentmeeting with attribution	JSON-L with source-span links	BERTScore, FActScore, SummaC, human rubric
Preference + safety (RLHF)	Pairwise or n-way preferencecritique writingred-teamfactualityDPO/IPO/KTO	HH-RLHF JSON-L, custom with metadata	Preference agreement, win-rate, MT-Bench-style
Document understanding	OCRreading orderlayoutstructured extractiontable-QAsignatures	PAGE-XML, ALTO, hOCR, FUNSD	CER, WER, field F1, ANLS

ANNOTATION TAXONOMY

Where text annotation fits in the LLM stack.

Pre-training corpora, SFT alignment data, RLHF preference pairs, eval and red-team. Every layer has a labeling problem.

TEXT DATA

Pre-training

filtering quality classifiers PII strip

SFT and instruction

prompt-response pairs tool-call traces agentic trajectories chain-of-thought

Preference

pairwise n-way critique DPO / IPO / KTO

Safety

red-team refusal factuality dual-use CBRN and cyber

Domain annotation

NER classification schema labeling

Evaluation

summarization faithfulness hallucination labels per-language regression cross-lingual transfer

HOW WE LABEL

Every project clears the same seven gates.

Calibration before production. Documented kappa thresholds. 100% human QA. The artifacts you need for your Article 10 file.

Schema design

Co-designed with your team. Versioned. Tied to the eval metric you will run against.

Deliverable: Versioned schema spec

Annotation guidelines

Edge cases enumerated, examples per class, glossary of domain terms. Updated as adjudication surfaces new cases.

Deliverable: Annotation guideline document

Calibration round

Pilot batch on a shared subset. Disagreements adjudicated before scale. Guidelines refined on findings.

Deliverable: Calibration kappa report

IAA gate

Production starts only when calibration clears Landis-Koch substantial agreement (kappa 0.6+) on the schema.

Deliverable: Gate-pass attestation

Production labeling

Trained linguists, native speakers for multilingual. Annotator selection criteria documented per project.

Deliverable: Labeled batches

100% human QA + adjudication

Every batch reviewed. Gold-set items injected at a documented rate. Disagreements above threshold escalate to a lead annotator with adjudication trail recorded.

Deliverable: QA review log + adjudication record

Delivery with kappa report

Per-schema metrics, error analysis, audit trail, and Article 30 records ship with the dataset.

Deliverable: Final delivery pack

Schema design

Co-designed with your team. Versioned. Tied to the eval metric you will run against.

Deliverable: Versioned schema spec

Annotation guidelines

Edge cases enumerated, examples per class, glossary of domain terms. Updated as adjudication surfaces new cases.

Deliverable: Annotation guideline document

Calibration round

Pilot batch on a shared subset. Disagreements adjudicated before scale. Guidelines refined on findings.

Deliverable: Calibration kappa report

IAA gate

Production starts only when calibration clears Landis-Koch substantial agreement (kappa 0.6+) on the schema.

Deliverable: Gate-pass attestation

Production labeling

Trained linguists, native speakers for multilingual. Annotator selection criteria documented per project.

Deliverable: Labeled batches

100% human QA + adjudication

Every batch reviewed. Gold-set items injected at a documented rate. Disagreements above threshold escalate to a lead annotator with adjudication trail recorded.

Deliverable: QA review log + adjudication record

Delivery with kappa report

Per-schema metrics, error analysis, audit trail, and Article 30 records ship with the dataset.

Deliverable: Final delivery pack

Seven gates. One trail of evidence. Every delivery.

Start the scoping process →

DOMAIN DEPTH

Three regulated verticals where generic crowdsource fails.

Medical ontologies, legal benchmarks, financial regulation. Domain knowledge plus YPAI infrastructure. Annotation only; not medical, legal, or investment advice.

Medical NLP

Clinical-note de-identification, ICD-10 / ICD-11 coding, SNOMED-CT and RxNorm linking, MedDRA pharmacovigilance, FHIR R4 export.

ICD-10SNOMEDFHIR R4

Legal NLP

Contract clause annotation (CUAD), LegalBench-style task suites, jurisdiction tagging, eDiscovery TAR, DORA and NIS2 rule extraction.

CUADLegalBenchDORA

Financial NLP

FOMC and ECB sentiment, earnings-call transcripts, ESG taxonomy and greenwashing, KYC and AML, MiFID II suitability, market abuse.

ESGKYCMiFID II

150+ LANGUAGES / NATIVE SPEAKERS

Where the LLM data gap actually lives.

LLM training data is roughly 45% English. EU AI Act Article 55 expects per-language disclosure across the 24 EU official languages. Most providers crowdsource translations; we staff native speakers.

EU official (24) 24 languages

Bulgarian Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hungarian Irish Italian Latvian Lithuanian Maltese Polish Portuguese Romanian Slovak Slovenian Spanish Swedish

EU minority and Nordic 9 languages

Sami Faroese Welsh Basque Catalan Galician Frisian Luxembourgish Icelandic

Indic and Southeast Asian 15 languages

Bengali Hindi Marathi Tamil Telugu Gujarati Kannada Malayalam Punjabi Thai Vietnamese Tagalog Indonesian Burmese Khmer

African and low-resource 10 languages

Swahili Yoruba Amharic Hausa Wolof Zulu Xhosa Igbo Tigrinya Oromo

58 languages shown / 150+ supported. FLORES-200 baseline; 200 languages total. Per-project native-speaker availability confirmed on brief.

YPAI BY THE NUMBERS

100%

HUMAN QA

0.6+

KAPPA THRESHOLD

30 days

ERASURE SLA

40,000+

VETTED CONTRIBUTORS

INDUSTRIES WE SERVE

Built for Buyers With Operational Risk

Healthcare, mobility, finance, government, and industrial teams need traceable data, defensible QA, and delivery records that survive review.

Defense & Government

Controlled data workflows for public-sector and defence-adjacent AI deployments.

Sovereignty Public sector Controls

Healthcare & Life Sciences

Clinical NLP, medical imaging, and consent-aware healthcare datasets.

Consent Clinical data GDPR

Financial Services

Document AI, model governance support, and risk-focused data workflows.

Risk review Documents Governance

Automotive & Mobility

ADAS, in-cabin AI, mobility datasets, and safety-sensitive annotation.

ADAS Voice Perception

Manufacturing & Industrial

Quality vision, industrial data capture, and reporting automation support.

Quality vision Sensor data Operations

Enterprise & AI

LLM training data, RAG evaluation, RLHF, and benchmark datasets.

RLHF Evaluation Benchmarks

Explore industry solutions →

Or describe the operating environment and YPAI will scope the data path.

WHAT YOU RECEIVE

Every delivery ships with the artifact pack your Article 10 file needs.

The records a regulated buyer expects, included with every text engagement. No upgrade tier, no separate request.

Annotation guideline and edge cases.

Versioned guideline with edge cases enumerated, examples per class, and glossary of domain terms. Updated as adjudication surfaces new patterns; every version preserved for audit.

Gold set with injection rate.

Held-out gold set; undisclosed injection rate during production; per-annotator accuracy gates. Calibration before queue entry, blind re-runs on failure.

Per-schema kappa or alpha report.

Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha computed per schema and per language pair on every delivery. Span tasks ship with per-class F1 as primary metric.

Error analysis with top failure modes.

Per-class confusion matrix, top failure modes by frequency, and recommended schema or guideline updates. The signal you need to decide whether to retrain or re-label.

Records of processing, DPA, and sub-processor list.

Article 30 records of processing, signed Article 28 DPA, lawful-basis documentation, 30-day erasure SLA, and full sub-processor list with Article 28(2) change notifications. All included with every engagement.

START A PROJECT

Brief us. We reply within one business day.

Short brief now, deeper scoping in the reply.

QUESTIONS BUYERS ACTUALLY ASK

Frequently asked questions

What text capabilities do you cover?

Annotation, SFT data, RLHF preference pairs, red-team, summarization, document understanding, and evaluation. Seven lanes covered by trained linguists; see the capability matrix for sub-tasks, schema, and eval metrics.

How many languages, and what level of native-speaker depth?

The contributor pool covers the 24 EU official languages, EU minority and Nordic languages, Indic and Southeast Asian languages, and a growing African low-resource lane. Per-project native-speaker availability confirmed on brief. Total of 150+ supported.

What quality evidence do I receive with each delivery?

Annotation guideline (versioned), gold set, per-schema Cohen's kappa or Krippendorff alpha report, error analysis, Article 30 records of processing, signed DPA, and sub-processor list. Per-class F1 for span tasks. No accuracy SLAs without your gold set.

Where is data processed?

EEA-resident. Norwegian company, EEA contributor network, EEA infrastructure. 30-day GDPR erasure SLA. Outside US CLOUD Act reach.

How do you handle annotator wellbeing on red-team and toxicity work?

Screened opt-in. Exposure limits per shift. Mandatory rotation off sensitive content. Counselling access. The specifics are documented in the project brief; we will not lead you to assume the duty-of-care obligation here is small.

GDPR-Native EU AI Act Article 10 EEA Operations Consent Evidence

Brief us on your text-data project.

One business day reply. NDA on request. DPA included.

Annotation through RLHF 150+ Languages

Or connect on LinkedIn →

Your information is never shared. We respond with the next scoping step.

Seven text-data capabilities.150+ languages. Linguist-staffed.

Compliance posture for text training and evaluation data.

EU AI Act Article 10

GDPR Articles 7, 28, 30

EEA-resident processing

Every project clears the same seven gates.

Schema design

Annotation guidelines

Calibration round

IAA gate

Production labeling

100% human QA + adjudication

Delivery with kappa report

Schema design

Annotation guidelines

Calibration round

IAA gate

Production labeling

100% human QA + adjudication

Delivery with kappa report

Three regulated verticals where generic crowdsource fails.

Medical NLP

Legal NLP

Financial NLP

Built for Buyers With Operational Risk

Defense & Government

Healthcare & Life Sciences

Financial Services

Automotive & Mobility

Manufacturing & Industrial

Enterprise & AI

Every delivery ships with the artifact pack your Article 10 file needs.

Annotation guideline and edge cases.

Gold set with injection rate.

Per-schema kappa or alpha report.

Error analysis with top failure modes.

Records of processing, DPA, and sub-processor list.

We are reviewing your brief.

Frequently asked questions

What text capabilities do you cover?

How many languages, and what level of native-speaker depth?

What quality evidence do I receive with each delivery?

Where is data processed?

How do you handle annotator wellbeing on red-team and toxicity work?

Brief us on your text-data project.