Quality Measurement Guide
Most companies measure zero of the eight quality dimensions that matter.

Your AI translates 10,000 segments a day. How many did anyone actually measure?

AI translation quality isn't one number. It's eight dimensions, each with defined thresholds, each requiring different escalation paths. This page shows you what to measure, how to set baselines, and when to escalate.

Jump to the five metrics ↓

Five signs your AI translation quality is unmanaged

These aren't arguments against AI translation. They're symptoms of deploying it without measurement. Every one of them is fixable once you have data.

Sign 01

No baseline, no trend

Deployed 6 months ago, can't say if quality improved or degraded. Without a baseline measurement, improvement is invisible and degradation is silent.

Sign 02

Spot-checking as strategy

Someone checks a random sample when they have time. Not systematically, not traceably. The industry calls this quality assurance. It's closer to hoping.

Sign 03

The fluency trap

AI translation is fluent. Grammatically correct. Sometimes completely wrong. Fluency masks errors because the output reads well even when the meaning has shifted. Human reviewers skim fluent text and miss semantic errors.

Sign 04

Nobody owns the score

Engineering deployed the model. Marketing writes the content. Operations routes it. Who owns the quality of the output? AI translation lives in the gap between departments.

Sign 05

Compliance exposure building quietly

The EU AI Act requires risk assessment for AI systems affecting people's access to information or safety. If your AI translates medical, legal, or safety content, documented quality governance may already be required.

EU AI Act in force since 2024

Five metrics that separate governed AI from ungoverned AI

Each metric measures a different failure mode. Together, they give you a complete picture of what your AI is actually producing.

1

Terminology accuracy

Are glossary terms used correctly? In pharmaceutical content, wrong terminology is a regulatory finding. In legal content, it's a liability shift.

2

Context adherence

Does the translation match the source meaning in context? The most common AI failure mode: grammatically correct translation that says something different from the source.

3

Hallucination rate

Does the AI add information not in the source? Names, dates, dosages, legal terms that appear in translation but not in the original. Low frequency, high consequence.

4

Style consistency

Does the output match the defined style guide? Brand voice, register (tu/vous, du/Sie), sentence structure. Inconsistency across markets erodes brand trust.

5

Entity preservation

Are names, numbers, dates, URLs, and product codes left intact? The simplest metric and the easiest to automate. Still, unmanaged AI corrupts entities regularly.

From zero measurement to governed quality in six weeks

Three phases. No platform required. Works on top of your existing TMS and translation workflow.

Week 1 to 2

Baseline audit

Score current AI output across all quality dimensions. Establish the starting point. Most companies discover their actual quality is lower than assumed.

Month 1

Threshold design

Define acceptable scores by content type. Marketing tolerates more variation than medical. Legal demands near-zero hallucination. Build escalation paths for each dimension.

Ongoing

Continuous monitoring

Automated quality scoring on every output. Weekly trend reports. Escalations route automatically when thresholds are breached. Quality governance runs continuously because AI output varies continuously.

The five metrics, measured

Published quality data from Kobalt's AI-enhanced localization operations, to date. These scores are auditable and tracked continuously.

97%
Terminology accuracy
95%
Context adherence
<5%
Hallucination rate target
94%
Style consistency
99%+
Entity preservation
<1%
Revision rate

*Terminology consistency measures approved-term usage in regulated content. Distinct from overall terminology accuracy (97%) across all content types.

A pharmaceutical company needed absolute terminology accuracy across 8 content channels and 25+ markets. They moved from spot-check QA to a governance framework with defined thresholds per content type. Scientific claim consistency reached 99%+. The cost of quality assurance dropped because catching errors early costs less than fixing them late.

Based on continuous operations with an international pharmaceutical brand.
"AI liability will force risk-based quality models. Organizations will need defensible quality governance."
CSA Research, 2026
"Agentic translation will scale, with state and guardrails. The value moves to orchestration, quality governance, and workflow control."
CSA Research, 2026
"We don't know what results we're getting. We deployed MT but have no framework to measure quality."
Localization Director, Profile D, Client Research

Questions about AI translation quality measurement

What is AI translation quality measurement?

Systematic scoring across defined dimensions (terminology, context, hallucination, style, entity preservation). Not a single number. Not a subjective review. Measurable, traceable, repeatable.

How do you measure AI translation quality?

Each dimension gets a score: terminology accuracy is measured against client glossaries, context adherence against source intent, hallucination by detecting information not present in the source. Scores are tracked over time to identify trends.

What quality score should I expect from AI translation?

Depends on content type and language pair. Well-governed AI on marketing content in major European languages typically achieves 90 to 95% across key dimensions. Unmanaged AI ranges from 70 to 85%. The gap is governance, not the model.

Is AI translation good enough for regulated content?

With governance, AI can handle some regulated content with appropriate human oversight. Without governance, no. The question isn't whether AI can translate medical or legal content. The question is whether anyone is measuring the output systematically.

What is the EU AI Act's impact on translation?

The EU AI Act requires risk assessment for AI systems affecting access to information, services, or safety. Translation systems producing medical, legal, or safety-critical content may require documented quality governance. A measurement framework provides the audit trail.

Can I measure quality with my current TMS?

A quality measurement framework is tool-agnostic. It sits on top of your existing workflow (Phrase, Lokalise, Smartling, any TMS). The framework defines what to measure and when to escalate. Your TMS handles content routing.

How long does it take to set up quality measurement?

Baseline audit: 1 to 2 weeks. Threshold design: about a month. Continuous monitoring: ongoing. Most companies have measurable quality data within 6 weeks.

What happens when AI output fails a quality threshold?

Escalation paths defined per dimension. Below-threshold terminology triggers glossary retraining. Below-threshold hallucination rates trigger human review or full stop on AI for that content type. The framework responds before the error reaches production.

Find out what your AI is actually producing

We score your current AI translation output across all quality dimensions. You get the data. No commitment, no platform to install.

Prefer email? ricard@kobaltlanguages.com