Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
- URL: http://arxiv.org/abs/2507.12260v2
- Date: Fri, 19 Sep 2025 15:29:20 GMT
- Title: Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
- Authors: Yikang Liu, Wanyang Zhang, Yiming Wang, Jialong Tang, Pei Zhang, Baosong Yang, Fei Huang, Rui Wang, Hai Hu,
- Abstract summary: We propose the first measure for translationese -- the translationese-index (T-index)<n>T-index is computed from the likelihood ratios of two contrastively fine-tuned language models (LMs)
- Score: 37.44429709909661
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Translationese refers to linguistic properties that usually occur in translated texts. Previous works study translationese by framing it as a binary classification between original texts and translated texts. In this paper, we argue that translationese should be graded instead of binary and propose the first measure for translationese -- the translationese-index (T-index), computed from the likelihood ratios of two contrastively fine-tuned language models (LMs). We use synthesized translations and translations in the wild to evaluate T-index's generalizability in cross-domain settings and its validity against human judgments. Our results show that T-index can generalize to unseen genres, authors, and language pairs. Moreover, T-index computed using two 0.5B LMs fine-tuned on only 1-5k pairs of synthetic data can effectively capture translationese, as demonstrated by alignment with human pointwise ratings and pairwise judgments. Additionally, the correlation between T-index and existing machine translation (MT) quality estimation (QE) metrics such as BLEU and COMET is low, suggesting that T-index is not covered by these metrics and can serve as a complementary metric in MT QE.
Related papers
- Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek [0.0]
This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT)<n>We evaluate translations by three commercial LLMs of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English.<n>We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COME
arXiv Detail & Related papers (2026-02-27T15:57:15Z) - Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z) - COMET-poly: Machine Translation Metric Grounded in Other Candidates [63.82506348745169]
We propose two automated metrics that incorporate additional information beyond the single translation.<n> COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand.<n>We find that including a single additional translation in COMET-polycand improves the segment-level metric performance.
arXiv Detail & Related papers (2025-08-25T22:55:22Z) - An Analysis on Automated Metrics for Evaluating Japanese-English Chat Translation [0.0]
We show that for ranking NMT models in chat translations, all metrics seem consistent in deciding which model outperforms the others.<n>On the other hand, neural-based metrics outperform traditional metrics, with COMET achieving the highest correlation with the human-annotated score on a chat translation.
arXiv Detail & Related papers (2024-12-24T05:54:40Z) - The Comparison of Translationese in Machine Translation and Human Transation in terms of Translation Relations [7.776258153133857]
The research employs two parallel corpora, each spanning nine genres with the same source texts with one translated by NMT and the other by humans.
The results indicate that NMT relies on literal translation significantly more than HT across genres.
arXiv Detail & Related papers (2024-03-27T19:12:20Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples.
We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - NMTScore: A Multilingual Analysis of Translation-based Text Similarity
Measures [42.46681912294797]
We analyze translation-based similarity measures in the common framework of multilingual NMT.
Compared to baselines such as sentence embeddings, translation-based measures prove competitive in paraphrase identification.
Measures show a relatively high correlation to human judgments.
arXiv Detail & Related papers (2022-04-28T17:57:17Z) - Decoding and Diversity in Machine Translation [90.33636694717954]
We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT.
Our study implicates search as a salient source of known bias when translating gender pronouns.
arXiv Detail & Related papers (2020-11-26T21:09:38Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - Revisiting Round-Trip Translation for Quality Estimation [0.0]
Quality estimation (QE) is the task of automatically evaluating the quality of translations without human-translated references.
In this paper, we employ semantic embeddings to RTT-based QE.
Our method achieves the highest correlations with human judgments, compared to previous WMT 2019 quality estimation metric task submissions.
arXiv Detail & Related papers (2020-04-29T03:20:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.