TEXT DATA / SEVEN CAPABILITY LANES

Seven text-data capabilities.150+ languages. Linguist-staffed.

From flat NER through preference data and red-team evaluation, the full text-data stack for LLM and NLP teams. Native-speaker linguists, EEA-resident processing, kappa reports per delivery.

  • Annotation through RLHF
  • 150+ languages
  • EEA processing
NATIVE SPEAKERS
Bulgarian Czech ν•œκ΅­μ–΄ PortuguΓͺs Suomi Ψ§Ω„ΨΉΨ±Ψ¨ΩŠΨ© Magyar Bahasa CatalΓ  Norsk Polski ఀెలుగు TΓΌrkΓ§e Eesti Kiswahili ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯€ Galego Svenska TiαΊΏng Việt FΓΈroyskt LatvieΕ‘u Cymraeg Ελληνικά YorΓΉbΓ‘ МакСдонски Euskara ΰΉ„ΰΈ—ΰΈ’ Π‘ΡŠΠ»Π³Π°Ρ€ΡΠΊΠΈ БСларуская LΓ«tzebuergesch isiXhosa ζ—₯本θͺž

PROCUREMENT READINESS

Compliance posture for text training and evaluation data.

Article 10 enforcement begins 2 August 2026. YPAI ships every text engagement with the artifacts a regulated buyer needs in their file.

Compliance posture

Annotation guideline
Gold set
Per-schema kappa report
Error analysis
Records of processing (Article 30)
Signed DPA + sub-processor list

EU AI Act Article 10

Data and data governance. Annotation guideline, gold set, kappa report, and error analysis included in every delivery.

GDPR Articles 7, 28, 30

Per-contributor consent records (Article 7). Processor agreement (Article 28). Records of processing (Article 30). 30-day erasure SLA.

EEA-resident processing

Norwegian company structure, EEA contributor network, EEA processing. Outside US CLOUD Act reach.

Request a Procurement Readiness Brief →

We map the evidence package to your data, risk class, and deployment environment.

WHAT WE LABEL

Seven lanes across the text-data lifecycle.

The capabilities a modern LLM or NLP team actually procures, in one provider relationship.

Lane Sub-tasks Schema / format Eval metric
Semantic (NER + linking + relations)
Flat NERnested NERentity linkingrelation extractioncoreference
CoNLL-U, BIO/IOB, BRAT, spaCy JSON, BioC Span F1, Cohen's kappa
Pragmatic (sentiment / intent / emotion / stance)
PolarityABSAPlutchik or Ekman or VADintentstanceclaim-evidence
JSON-L with per-annotator IDs Krippendorff alpha, macro F1, micro F1
Classification (topic, multi-label, hierarchical, toxicity)
Flatmulti-labelhierarchical (ICD-10)topicroutingtoxicityPII
JSON-L, taxonomies in SKOS Hierarchical F1, calibration ECE, FPR-at-low-FNR
Linguistic (POS / dependency / coref / morphology / WSD)
UD POS and dependencyconstituencycoreferencemorphologylemmatizationWSD
CoNLL-U (UD), Penn Treebank LAS, UAS, CoNLL-F1
Generation (summarization)
Extractiveabstractivequery-focusedlong-documentmeeting with attribution
JSON-L with source-span links BERTScore, FActScore, SummaC, human rubric
Preference + safety (RLHF)
Pairwise or n-way preferencecritique writingred-teamfactualityDPO/IPO/KTO
HH-RLHF JSON-L, custom with metadata Preference agreement, win-rate, MT-Bench-style
Document understanding
OCRreading orderlayoutstructured extractiontable-QAsignatures
PAGE-XML, ALTO, hOCR, FUNSD CER, WER, field F1, ANLS

ANNOTATION TAXONOMY

Where text annotation fits in the LLM stack.

Pre-training corpora, SFT alignment data, RLHF preference pairs, eval and red-team. Every layer has a labeling problem.

TEXT DATA
Pre-training
filtering quality classifiers PII strip
SFT and instruction
prompt-response pairs tool-call traces agentic trajectories chain-of-thought
Preference
pairwise n-way critique DPO / IPO / KTO
Safety
red-team refusal factuality dual-use CBRN and cyber
Domain annotation
NER classification schema labeling
Evaluation
summarization faithfulness hallucination labels per-language regression cross-lingual transfer

HOW WE LABEL

Every project clears the same seven gates.

Calibration before production. Documented kappa thresholds. 100% human QA. The artifacts you need for your Article 10 file.

01

Schema design

Co-designed with your team. Versioned. Tied to the eval metric you will run against.

Deliverable: Versioned schema spec

02

Annotation guidelines

Edge cases enumerated, examples per class, glossary of domain terms. Updated as adjudication surfaces new cases.

Deliverable: Annotation guideline document

03

Calibration round

Pilot batch on a shared subset. Disagreements adjudicated before scale. Guidelines refined on findings.

Deliverable: Calibration kappa report

04

IAA gate

Production starts only when calibration clears Landis-Koch substantial agreement (kappa 0.6+) on the schema.

Deliverable: Gate-pass attestation

05

Production labeling

Trained linguists, native speakers for multilingual. Annotator selection criteria documented per project.

Deliverable: Labeled batches

06

100% human QA + adjudication

Every batch reviewed. Gold-set items injected at a documented rate. Disagreements above threshold escalate to a lead annotator with adjudication trail recorded.

Deliverable: QA review log + adjudication record

07

Delivery with kappa report

Per-schema metrics, error analysis, audit trail, and Article 30 records ship with the dataset.

Deliverable: Final delivery pack

Seven gates. One trail of evidence. Every delivery.

150+ LANGUAGES / NATIVE SPEAKERS

Where the LLM data gap actually lives.

LLM training data is roughly 45% English. EU AI Act Article 55 expects per-language disclosure across the 24 EU official languages. Most providers crowdsource translations; we staff native speakers.

EU official (24) 24 languages
Bulgarian Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hungarian Irish Italian Latvian Lithuanian Maltese Polish Portuguese Romanian Slovak Slovenian Spanish Swedish
EU minority and Nordic 9 languages
Sami Faroese Welsh Basque Catalan Galician Frisian Luxembourgish Icelandic
Indic and Southeast Asian 15 languages
Bengali Hindi Marathi Tamil Telugu Gujarati Kannada Malayalam Punjabi Thai Vietnamese Tagalog Indonesian Burmese Khmer
African and low-resource 10 languages
Swahili Yoruba Amharic Hausa Wolof Zulu Xhosa Igbo Tigrinya Oromo

58 languages shown / 150+ supported. FLORES-200 baseline; 200 languages total. Per-project native-speaker availability confirmed on brief.

YPAI BY THE NUMBERS

100%

HUMAN QA

0.6+

KAPPA THRESHOLD

30 days

ERASURE SLA

40,000+

VETTED CONTRIBUTORS

WHAT YOU RECEIVE

Every delivery ships with the artifact pack your Article 10 file needs.

The records a regulated buyer expects, included with every text engagement. No upgrade tier, no separate request.

Annotation guideline and edge cases.

Versioned guideline with edge cases enumerated, examples per class, and glossary of domain terms. Updated as adjudication surfaces new patterns; every version preserved for audit.

Gold set with injection rate.

Held-out gold set; undisclosed injection rate during production; per-annotator accuracy gates. Calibration before queue entry, blind re-runs on failure.

Per-schema kappa or alpha report.

Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha computed per schema and per language pair on every delivery. Span tasks ship with per-class F1 as primary metric.

Error analysis with top failure modes.

Per-class confusion matrix, top failure modes by frequency, and recommended schema or guideline updates. The signal you need to decide whether to retrain or re-label.

Records of processing, DPA, and sub-processor list.

Article 30 records of processing, signed Article 28 DPA, lawful-basis documentation, 30-day erasure SLA, and full sub-processor list with Article 28(2) change notifications. All included with every engagement.

START A PROJECT

Brief us. We reply within one business day.

Short brief now, deeper scoping in the reply.

Capability lanes (NER, RLHF, etc.), languages, volume, regulatory context.

QUESTIONS BUYERS ACTUALLY ASK

Frequently asked questions

Annotation, SFT data, RLHF preference pairs, red-team, summarization, document understanding, and evaluation. Seven lanes covered by trained linguists; see the capability matrix for sub-tasks, schema, and eval metrics.

The contributor pool covers the 24 EU official languages, EU minority and Nordic languages, Indic and Southeast Asian languages, and a growing African low-resource lane. Per-project native-speaker availability confirmed on brief. Total of 150+ supported.

Annotation guideline (versioned), gold set, per-schema Cohen's kappa or Krippendorff alpha report, error analysis, Article 30 records of processing, signed DPA, and sub-processor list. Per-class F1 for span tasks. No accuracy SLAs without your gold set.

EEA-resident. Norwegian company, EEA contributor network, EEA infrastructure. 30-day GDPR erasure SLA. Outside US CLOUD Act reach.

Screened opt-in. Exposure limits per shift. Mandatory rotation off sensitive content. Counselling access. The specifics are documented in the project brief; we will not lead you to assume the duty-of-care obligation here is small.

GDPR-Native EU AI Act Article 10 EEA Operations Consent Evidence

Brief us on your text-data project.

One business day reply. NDA on request. DPA included.

Annotation through RLHF 150+ Languages
Or connect on LinkedIn →

Your information is never shared. We respond with the next scoping step.