Test-Time Scaling of Reasoning Models for Machine Translation
- URL: http://arxiv.org/abs/2510.06471v1
- Date: Tue, 07 Oct 2025 21:15:18 GMT
- Title: Test-Time Scaling of Reasoning Models for Machine Translation
- Authors: Zihao Li, Shaoxiong Ji, Jörg Tiedemann,
- Abstract summary: Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding.<n>This paper investigates whether increased inference-time computation improves translation quality.
- Score: 16.317481079574065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model's reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.
Related papers
- Unlocking Reasoning Capability on Machine Translation in Large Language Models [57.60641851466707]
Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning.<n>We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark.<n>We find that enabling explicit reasoning consistently degrades translation quality across languages and models.
arXiv Detail & Related papers (2026-02-16T14:05:59Z) - Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs [69.28193153685893]
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning.<n>To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features.<n>Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models.
arXiv Detail & Related papers (2026-01-16T06:29:07Z) - Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z) - Limits and Gains of Test-Time Scaling in Vision-Language Reasoning [8.76012279865596]
Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference.<n>We present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source Vision-Language Models (VLMs) on different benchmarks.
arXiv Detail & Related papers (2025-12-11T20:48:54Z) - Think Right, Not More: Test-Time Scaling for Numerical Claim Verification [14.07771397213171]
We show that test-time compute can be used to verify complex numerical claims.<n>We introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim.<n>This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods.
arXiv Detail & Related papers (2025-09-26T09:23:35Z) - TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance [5.306276499628096]
Machine translation (MT) post-editing and research data collection often rely on inefficient translation, disconnected.<n>We introduce TranslationCorrect, an integrated framework designed to streamline these tasks.<n>It combines MT generation using models like NLLB, automated error prediction using models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive post-editing interface within a single environment.
arXiv Detail & Related papers (2025-06-23T06:38:49Z) - Context-Aware Machine Translation with Source Coreference Explanation [26.336947440529713]
We propose a model that explains the decisions made for translation by predicting coreference features in the input.
We evaluate our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset.
arXiv Detail & Related papers (2024-04-30T12:41:00Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Non-Autoregressive Document-Level Machine Translation [35.48195990457836]
Non-autoregressive translation (NAT) models achieve comparable performance and superior speed compared to auto-regressive translation (AT) models.
However, their abilities are unexplored in document-level machine translation (MT)
We propose a simple but effective design of sentence alignment between source and target.
arXiv Detail & Related papers (2023-05-22T09:59:59Z) - Candidate Soups: Fusing Candidate Results Improves Translation Quality
for Non-Autoregressive Translation [15.332496335303189]
Non-autoregressive translation (NAT) model achieves a much faster inference speed than the autoregressive translation (AT) model.
Existing NAT methods only focus on improving the NAT model's performance but do not fully utilize it.
We propose a simple but effective method called "Candidate Soups," which can obtain high-quality translations.
arXiv Detail & Related papers (2023-01-27T02:39:42Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.