Fugu-MT 論文翻訳(概要): Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

論文の概要: Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

arxiv url: http://arxiv.org/abs/2509.25543v1
Date: Mon, 29 Sep 2025 22:03:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.347099
Title: Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model
Title（参考訳）: 高リソースエキスパートモデルからの検証可能なセマンティクスによる多言語推論のアライメント
Authors: Fahim Faisal, Kaiqiang Song, Song Wang, Simin Ma, Shujian Liu, Haoyun Deng, Sathish Reddy Indurthi,
Abstract要約: 本稿では,セマンティック検証リワードを用いたPivot-based Reinforcement Learningを紹介する。このフレームワークは、ターゲット言語における人間の注釈付きデータの必要性を回避し、多言語推論を強化する。提案手法は,英語と他言語のパフォーマンスギャップを著しく狭めることを示す。
参考スコア（独自算出の注目度）: 13.788758077632432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.
Abstract（参考訳）: 強化学習は、Large Language Models (LLMs) の推論能力を向上させる一方で、これらの向上は英語に限られており、言語間での大きなパフォーマンス格差を生み出している。これを解決するために、ターゲット言語における人間アノテーションデータの必要性を回避し、多言語推論を強化する新しいフレームワークPB-RLSVR(Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards)を導入する。提案手法では、推論タスクに対する参照応答を生成するために、ハイパフォーマンスなLLMを"pivot"モデルとして採用している。その後、英語参照に対する応答の意味的等価性に基づいて多言語モデルに報酬が与えられ、ピボットモデルの推論能力を言語間で効果的に伝達する。本稿では,埋め込みや機械翻訳などの言語間意味報酬関数について検討する。多言語推論ベンチマークの大規模な実験は、我々の手法が英語と他の言語のパフォーマンスギャップを著しく狭め、従来のPPOベースラインを大幅に上回っていることを示している。具体的には、PB-RLSVRフレームワークはLlama-3.1-8B-InstructとQwen3-32Bの平均多言語性能をそれぞれ16.41%、Qwen3-32Bを10.17%向上させ、真の多言語推論エージェントを構築するための強力でデータ効率のよいアプローチを示す。

論文の概要: Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

関連論文リスト