Related papers: Unlocking Reasoning Capability on Machine Translation in Large Language Models

Unlocking Reasoning Capability on Machine Translation in Large Language Models

URL: http://arxiv.org/abs/2602.14763v1
Date: Mon, 16 Feb 2026 14:05:59 GMT
Title: Unlocking Reasoning Capability on Machine Translation in Large Language Models
Authors: Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi,
Abstract summary: Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning.<n>We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark.<n>We find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Score: 57.60641851466707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.

Related papers

Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection [0.15229257192293197]
Low-resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources.<n>Recent advances in large language models suggest that self-reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency.<n>This paper introduces Reflective Translation, a prompt-based framework in which a model generates an initial translation, produces a structured self-critique, and then uses this reflection to generate a refined translation.
arXiv Detail & Related papers (2026-01-27T18:37:09Z)
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z)
LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens [25.257363122413395]
"Thinking tokens" do not help LRMs better perform machine translation.<n>Fine-tuning a model with synthetic CoT explanations does not outperform standard input-output fine-tuning.<n>Our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.
arXiv Detail & Related papers (2025-10-13T20:41:01Z)
Test-Time Scaling of Reasoning Models for Machine Translation [16.317481079574065]
Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding.<n>This paper investigates whether increased inference-time computation improves translation quality.
arXiv Detail & Related papers (2025-10-07T21:15:18Z)
mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning [74.97363626515236]
We propose a textbfMultilingual and Scalable Benchmark for textbfSkill-based textbfCommonsense textbfReasoning (textbfmSCoRe)<n>Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities.<n>Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense.
arXiv Detail & Related papers (2025-08-13T18:59:02Z)
Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data [64.4458540273004]
We propose a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of Large Language Models (LLMs)<n>Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions.
arXiv Detail & Related papers (2025-04-20T16:20:30Z)
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation [1.2576388595811496]
We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language.<n>We permute reasoning problems written in real languages to generate numerous question variations.<n>Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge.
arXiv Detail & Related papers (2025-03-04T19:57:47Z)
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training [66.48331530995786]
We propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context.<n>Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage.<n>Experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations.
arXiv Detail & Related papers (2025-02-25T03:03:35Z)
Context-Aware Machine Translation with Source Coreference Explanation [26.336947440529713]
We propose a model that explains the decisions made for translation by predicting coreference features in the input. We evaluate our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset.
arXiv Detail & Related papers (2024-04-30T12:41:00Z)
Evaluating and Improving the Coreference Capabilities of Machine Translation Models [30.60934078720647]
Machine translation requires a wide range of linguistic capabilities. Current end-to-end models are expected to learn implicitly by observing aligned sentences in bilingual corpora.
arXiv Detail & Related papers (2023-02-16T18:16:09Z)
When Does Translation Require Context? A Data-driven, Multilingual Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT) Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation. We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.