AI translation quality isn't one number. It's eight dimensions, each with defined thresholds, each requiring different escalation paths. This page shows you what to measure, how to set baselines, and when to escalate.
Jump to the five metrics ↓These aren't arguments against AI translation. They're symptoms of deploying it without measurement. Every one of them is fixable once you have data.
Deployed 6 months ago, can't say if quality improved or degraded. Without a baseline measurement, improvement is invisible and degradation is silent.
Someone checks a random sample when they have time. Not systematically, not traceably. The industry calls this quality assurance. It's closer to hoping.
AI translation is fluent. Grammatically correct. Sometimes completely wrong. Fluency masks errors because the output reads well even when the meaning has shifted. Human reviewers skim fluent text and miss semantic errors.
Engineering deployed the model. Marketing writes the content. Operations routes it. Who owns the quality of the output? AI translation lives in the gap between departments.
The EU AI Act requires risk assessment for AI systems affecting people's access to information or safety. If your AI translates medical, legal, or safety content, documented quality governance may already be required.
EU AI Act in force since 2024Each metric measures a different failure mode. Together, they give you a complete picture of what your AI is actually producing.
Are glossary terms used correctly? In pharmaceutical content, wrong terminology is a regulatory finding. In legal content, it's a liability shift.
Does the translation match the source meaning in context? The most common AI failure mode: grammatically correct translation that says something different from the source.
Does the AI add information not in the source? Names, dates, dosages, legal terms that appear in translation but not in the original. Low frequency, high consequence.
Does the output match the defined style guide? Brand voice, register (tu/vous, du/Sie), sentence structure. Inconsistency across markets erodes brand trust.
Are names, numbers, dates, URLs, and product codes left intact? The simplest metric and the easiest to automate. Still, unmanaged AI corrupts entities regularly.
Three phases. No platform required. Works on top of your existing TMS and translation workflow.
Score current AI output across all quality dimensions. Establish the starting point. Most companies discover their actual quality is lower than assumed.
Define acceptable scores by content type. Marketing tolerates more variation than medical. Legal demands near-zero hallucination. Build escalation paths for each dimension.
Automated quality scoring on every output. Weekly trend reports. Escalations route automatically when thresholds are breached. Quality governance runs continuously because AI output varies continuously.
Published quality data from Kobalt's AI-enhanced localization operations, to date. These scores are auditable and tracked continuously.
*Terminology consistency measures approved-term usage in regulated content. Distinct from overall terminology accuracy (97%) across all content types.
A pharmaceutical company needed absolute terminology accuracy across 8 content channels and 25+ markets. They moved from spot-check QA to a governance framework with defined thresholds per content type. Scientific claim consistency reached 99%+. The cost of quality assurance dropped because catching errors early costs less than fixing them late.
"AI liability will force risk-based quality models. Organizations will need defensible quality governance."CSA Research, 2026
"Agentic translation will scale, with state and guardrails. The value moves to orchestration, quality governance, and workflow control."CSA Research, 2026
"We don't know what results we're getting. We deployed MT but have no framework to measure quality."Localization Director, Profile D, Client Research
Systematic scoring across defined dimensions (terminology, context, hallucination, style, entity preservation). Not a single number. Not a subjective review. Measurable, traceable, repeatable.
Each dimension gets a score: terminology accuracy is measured against client glossaries, context adherence against source intent, hallucination by detecting information not present in the source. Scores are tracked over time to identify trends.
Depends on content type and language pair. Well-governed AI on marketing content in major European languages typically achieves 90 to 95% across key dimensions. Unmanaged AI ranges from 70 to 85%. The gap is governance, not the model.
With governance, AI can handle some regulated content with appropriate human oversight. Without governance, no. The question isn't whether AI can translate medical or legal content. The question is whether anyone is measuring the output systematically.
The EU AI Act requires risk assessment for AI systems affecting access to information, services, or safety. Translation systems producing medical, legal, or safety-critical content may require documented quality governance. A measurement framework provides the audit trail.
A quality measurement framework is tool-agnostic. It sits on top of your existing workflow (Phrase, Lokalise, Smartling, any TMS). The framework defines what to measure and when to escalate. Your TMS handles content routing.
Baseline audit: 1 to 2 weeks. Threshold design: about a month. Continuous monitoring: ongoing. Most companies have measurable quality data within 6 weeks.
Escalation paths defined per dimension. Below-threshold terminology triggers glossary retraining. Below-threshold hallucination rates trigger human review or full stop on AI for that content type. The framework responds before the error reaches production.
We score your current AI translation output across all quality dimensions. You get the data. No commitment, no platform to install.
Prefer email? ricard@kobaltlanguages.com