Fugu-MT 論文翻訳(概要): Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

論文の概要: Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

arxiv url: http://arxiv.org/abs/2510.20780v1
Date: Thu, 23 Oct 2025 17:48:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:18.522763
Title: Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
Title（参考訳）: 大規模推論モデルは優れた翻訳評価器か?分析と性能向上
Authors: Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong,
Abstract要約: 大規模な推論モデル(LRM)は、機械翻訳(MT)の品質評価を行うことができる。 MT評価におけるLRM-as-a-judgeの最初の系統解析を行った。そこで我々は,LRM思考を人工的,人間的な思考軌跡で訓練することで校正することを提案する。
参考スコア（独自算出の注目度）: 47.98620231787199
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
Abstract（参考訳）: 大規模推論モデル(LRM)の最近の進歩は、最終回答を生成する前に中間的な「思考」プロセスを導入し、複雑な下流タスクにおける推論能力を改善している。しかし、機械翻訳(MT)品質評価装置としてのLRMの可能性はいまだ未解明である。 MT評価における LRM-as-a-judge の最初の系統解析を行った。我々は、重要な課題を特定し、LRMが適切な評価材料を必要とすることを明らかにし、より単純な事例を「過度に考える」傾向にあり、過大評価につながるスコアリング機構に問題があることを明らかにした。これらの問題に対処するために, 人工的, 人為的思考軌跡を訓練することにより, LRM思考を校正することを提案する。 WMT24 Metrics ベンチマーク実験により, 提案手法は, 7B から 32B (R1-Distill-Qwen-7B では+8.7 の相関点改善が達成される) の異なる LRM スケールにおける評価性能を同時に向上する一方で, 思考予算を ~35倍削減することを示した。これらの結果は, 微粒な自動MT評価を推し進めるために, LRMを効率よく校正する可能性を浮き彫りにした。

論文の概要: Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

関連論文リスト