Related papers: Specification-Aware Machine Translation and Evaluation for Purpose Alignment

Specification-Aware Machine Translation and Evaluation for Purpose Alignment

URL: http://arxiv.org/abs/2509.17559v1
Date: Mon, 22 Sep 2025 10:50:37 GMT
Title: Specification-Aware Machine Translation and Evaluation for Purpose Alignment
Authors: Yoko Kayano, Saku Sugawara,
Abstract summary: We provide a theoretical rationale for why specifications matter in professional translation, as well as a practical guide to implementing specification-aware machine translation (MT)<n>We compare five translation types, including official human translations and prompt-based outputs from large language models (LLMs), using expert error analysis, user preference rankings, and an automatic metric.<n>The results show that translations guided by specifications consistently outperformed official human translations in human evaluations, highlighting a gap between perceived and expected quality.
Score: 10.50113943900077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In professional settings, translation is guided by communicative goals and client needs, often formalized as specifications. While existing evaluation frameworks acknowledge the importance of such specifications, these specifications are often treated only implicitly in machine translation (MT) research. Drawing on translation studies, we provide a theoretical rationale for why specifications matter in professional translation, as well as a practical guide to implementing specification-aware MT and evaluation. Building on this foundation, we apply our framework to the translation of investor relations texts from 33 publicly listed companies. In our experiment, we compare five translation types, including official human translations and prompt-based outputs from large language models (LLMs), using expert error analysis, user preference rankings, and an automatic metric. The results show that LLM translations guided by specifications consistently outperformed official human translations in human evaluations, highlighting a gap between perceived and expected quality. These findings demonstrate that integrating specifications into MT workflows, with human oversight, can improve translation quality in ways aligned with professional practice.

Related papers

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z)
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation [31.1561882673283]
DITING is the first comprehensive evaluation framework for web novel translation.<n>AgentEval simulates expert deliberation to assess translation quality beyond lexical overlap.<n>We develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores.
arXiv Detail & Related papers (2025-10-10T08:10:10Z)
Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation [70.43884512651668]
We formalize Genette's (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for machine translation.<n>We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai.<n>Our findings demonstrate the potential of paratextual explicitation in advancing machine translation beyond linguistic equivalence.
arXiv Detail & Related papers (2025-09-27T16:27:36Z)
LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering [21.28047224832753]
LiTransProQA is a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation.<n>It integrates insights from professional literary translators and researchers, focusing on literary devices, cultural understanding, and authorial voice.<n>LiTransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments.
arXiv Detail & Related papers (2025-05-08T17:12:56Z)
Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.<n>We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z)
(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts [56.7988577327046]
We introduce TransAgents, a novel multi-agent framework that simulates the roles and collaborative practices of a human translation company.<n>Our findings highlight the potential of multi-agent collaboration in enhancing translation quality, particularly for longer texts.
arXiv Detail & Related papers (2024-05-20T05:55:08Z)
Optimizing Machine Translation through Prompt Engineering: An Investigation into ChatGPT's Customizability [0.0]
The study reveals that the inclusion of suitable prompts in large-scale language models like ChatGPT can yield flexible translations. The research scrutinizes the changes in translation quality when prompts are used to generate translations that meet specific conditions.
arXiv Detail & Related papers (2023-08-02T19:11:04Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
A Bayesian approach to translators' reliability assessment [0.0]
We consider the Translation Quality Assessment process as a complex process, considering it from the physics of complex systems point of view. We build two Bayesian models that parameterise the features involved in the TQA process, namely the translation difficulty, the characteristics of the translators involved in producing the translation and assessing its quality. We show that reviewers reliability cannot be taken for granted even if they are expert translators.
arXiv Detail & Related papers (2022-03-14T14:29:45Z)
Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.