Fugu-MT 論文翻訳(概要): Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

論文の概要: Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

arxiv url: http://arxiv.org/abs/2605.11679v2
Date: Wed, 13 May 2026 09:28:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.892011
Title: Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
Title（参考訳）: 優先次元展開による安全ハープフルネスシーリングの解説と破壊
Authors: ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo, Zhenhong Zhou, An Zhang, Junhao Dong, Kun Wang, Zhigang Zeng,
Abstract要約: 大規模言語モデルの多目的アライメントの領域では、異なる人間の嗜好のバランスがゼロサム競合として現れることが多い。提案するMORA: Multi-Objective Reward Assimilationは, 複数次元インテントを組み込むために, 元の質問を書き換えることにより, 報酬の多様性を拡大し, シングルリワードプロンプトを分離する。
参考スコア（独自算出の注目度）: 75.37308041820552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.
Abstract（参考訳）: 大規模言語モデルの多目的アライメントの領域では、異なる人間の嗜好のバランスがゼロサム競合として現れることが多い。特に、競合するゴール間の本質的な緊張は、1つのメートル法(例えば、助け)に対して積極的に最適化する決定を下し、しばしば他の(例えば、無害)に対して実質的なペナルティを生じさせる。以前の研究は主にデータ選択、パラメータのマージ、あるいはトレーニング中のアルゴリズムのバランスに焦点を当てていたが、これらのアプローチは、固定されたParetoフロンティアに沿って異なる好みの間の妥協を強要するだけで、固有のトレードオフを根本的に解決することができない。本研究では,多次元報酬の新たな視点からこの問題にアプローチする。モデルのロールアウトをスケールアップし、異なる報酬次元にわたるアウトプットを分析することで、複数の目的間の対立は、プロンプト自体が達成可能な多次元報酬を本質的に制限するという事実から、決定的な結論に達する。このコア観測に基づいてMORA: Multi-Objective Reward Assimilationを提案する。特に、MORAは、プリサンプリングを通じてシングルリワードプロンプトを分離し、元の質問を書き換え、多次元インテントを組み込むことで報酬の多様性を拡大する。 1)連続的なアライメントにおいて、MORAは5%から12.4%の範囲で単一参照の改善を達成し、無害性は例外的に向上した。 2) 同時アライメントでは, MORAは平均4.6%の報奨改善を達成している。私たちのコードはhttps://github.com/Shiying-Huang/MORA-MPA.comで公開されています。

論文の概要: Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

関連論文リスト