Fugu-MT 論文翻訳(概要): One Model for All: Multi-Objective Controllable Language Models

論文の概要: One Model for All: Multi-Objective Controllable Language Models

arxiv url: http://arxiv.org/abs/2604.04497v1
Date: Mon, 06 Apr 2026 07:48:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.137099
Title: One Model for All: Multi-Objective Controllable Language Models
Title（参考訳）: 一つのモデル:多目的可制御言語モデル
Authors: Qiang He, Yucheng Yang, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy, Setareh Maghsudi,
Abstract要約: 我々は、優先条件付きポリシーネットワークとして単一の言語モデルをトレーニングするために、MOC(Multi-Objective Control)を導入する。提案手法では,マルチオブジェクト最適化(MOO)の原則をRLHFに導入し,LLMを優先条件付きポリシネットワークとしてトレーニングする。大規模な実験は、3つの面において基線よりもMOCの利点を実証している。
参考スコア（独自算出の注目度）: 65.4626816393381
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC's potential for real-world applications requiring scalable and customizable LLMs.
Abstract（参考訳）: 大きな言語モデル(LLM)を人間の好みで調整することは、LLMの安全性、有用性、ユーモア、忠実性などの向上に不可欠である。人間のフィードバックからの現在の強化学習(RLHF)は、主に平均的な人間の評価から得られた固定報酬に焦点を当てており、様々な好みの適応性と制御性を弱める可能性がある。しかし、パーソナライズされたLCMの作成には、ユーザ毎のデータ不足や多目的トレードオフにおけるユーザの嗜好の多様性から、特定のコンテキストにおける共感の強調から、他のコンテキストにおける効率性と精度の要求に至るまで、自明なLCMの調整が必要である。 1つのLCMをトレーニングして、Paretoのフロントでさまざまなユーザの好みに応じてパーソナライズされたアウトプットを生成できますか? 本稿では,パレートフロントの嗜好定義領域で直接応答を生成するために,単一のLLMを訓練する多目的制御(MOC)を導入する。提案手法では,マルチオブジェクト最適化(MOO)の原則をRLHFに導入し,LLMを優先条件付きポリシネットワークとしてトレーニングする。ポリシーレベルでMOOを適用し,単一のA6000 GPU上で7Bパラメータモデルを微調整することで,MOCの計算効率を向上させる。広汎な実験は、3つの点において基線よりもMOCの利点を実証している。 i) LLMの可制御性は,複数の報酬のトレードオフに関するユーザ嗜好を出力する。 (II)複数解の超体積で測定したLCM出力の品質と多様性 (三)見知らぬ好みへの一般化これらの結果は、スケーラブルでカスタマイズ可能なLLMを必要とする実世界のアプリケーションに対するMOCの可能性を浮き彫りにしている。

論文の概要: One Model for All: Multi-Objective Controllable Language Models

関連論文リスト