Fugu-MT 論文翻訳(概要): Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

論文の概要: Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

arxiv url: http://arxiv.org/abs/2603.10588v1
Date: Wed, 11 Mar 2026 09:45:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.882106
Title: Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning
Title（参考訳）: LLMアライメントは本当に多様性を必要とするか? : モーラル推論のためのRLVR法の適用に関する実証的研究
Authors: Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang, Zhiyuan Feng, Yaodong Yang, Xiaoyuan Yi, Xing Xie,
Abstract要約: 本研究では,アライメントタスクにおいて期待される報酬最大化手法に対して,分布マッチング手法が有意な優位性を示すものではないことを示す。その結果,アライメントタスクは本質的に多様性保存アルゴリズムを必要としないことが示唆され,標準報酬最大化RLVR法は明確な多様性機構を持たずに道徳的推論に効果的に移行できることがわかった。
参考スコア（独自算出の注目度）: 44.68959659268472
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は論理的推論タスクにおいて顕著な成功を収めているが、大規模言語モデル(LLM)のアライメントが根本的に異なるアプローチを必要とするかどうかは不明だ。道徳的推論における複数の有効な応答に対する明らかな許容性を考えると、自然な仮説は、アライメントタスクは、報酬を最大化するポリシーベースの手法ではなく、本質的に多様性を求める分布マッチングアルゴリズムを必要とするということである。両パラダイムをMoReBench上で比較した最初の包括的実証的研究を行う。安定したRLVRトレーニングを実現するため、我々はQwen3-1.7Bジャッジモデルをトレーニングすることで、ルーリックグラウンドの報酬パイプラインを構築した。我々の仮説とは対照的に、分配マッチングアプローチはアライメントタスクにおいて期待される報酬最大化手法よりも大きな利点を示さない。意味空間への高次応答のマッピングを通じて、道徳的推論は数学的推論よりも集中した高次分布を示し、多様な解法戦略が同様に高い報酬をもたらすことを示した。この反直感的な発見は、なぜモード探索最適化がアライメントタスクに等しく、より効果的かを説明する。その結果,アライメントタスクは本質的に多様性保存アルゴリズムを必要としないことが示唆され,標準報酬最大化RLVR法は明確な多様性機構を持たずに道徳的推論に効果的に移行できることがわかった。

論文の概要: Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

関連論文リスト