Fugu-MT 論文翻訳(概要): Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

論文の概要: Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

arxiv url: http://arxiv.org/abs/2603.13045v1
Date: Fri, 13 Mar 2026 14:52:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.140231
Title: Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation
Title（参考訳）: エンディング・ザ・ホール:多言語翻訳のための強化学習におけるリワードハックの軽減
Authors: Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa, Lei Li,
Abstract要約: 既存のポストトレーニング手法は、高品質な並列データに大きく依存している。単言語テキストのみを用いた強化学習手法であるWALARを紹介する。我々は,WALARのRLトレーニングに対する報奨として,単語アライメントや言語アライメントなどの手法を開発し,そのような穴を緩和する。
参考スコア（独自算出の注目度）: 9.906839381314082
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.
Abstract（参考訳）: 大規模言語モデル(LLM)は、高リソースの言語ペア上での機械翻訳において顕著な能力を示しているが、低リソースの翻訳のパフォーマンスは依然として遅れている。既存のポストトレーニング手法は高品質な並列データに大きく依存しており、低リソースの言語では少ないか、利用できないことが多い。本稿では,モノリンガルテキストのみを用いた強化学習手法であるWALARについて紹介する。我々の重要な洞察は、既存のソースベース多言語品質推定(QE)モデルにおける障害モード(または「ホール」)の観測に基づいています。これらのQEモデルを用いた強化学習(RL)はそのような穴を増幅する傾向があり、結果として多言語LLMがより貧弱になる。我々は,WALARのRLトレーニングに対する報奨として,単語アライメントや言語アライメントなどの手法を開発し,そのような穴を緩和する。 WALARを用いて101言語を翻訳するLLMを継続的に訓練した。実験の結果、我々の新しいモデルは、Flores-101データセット上で1400の言語方向に対して大きなマージンで最強のオープンソース多言語LLMであるLLaMAXより優れていることがわかった。

論文の概要: Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

関連論文リスト