Fugu-MT 論文翻訳(概要): Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data

論文の概要: Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data

arxiv url: http://arxiv.org/abs/2603.19294v1
Date: Tue, 10 Mar 2026 21:00:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.857688
Title: Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data
Title（参考訳）: ユーザコンテキストと応答間の相互情報の最大化は、追加データなしでLLMのパーソナライゼーションを改善する
Authors: Hyunji Nam, Haoran Li, Natasha Jaques,
Abstract要約: 本稿では、適切なプロンプトに対して正の応答条件を発生させ、ランダムな無関係なプロンプトに対して負の応答を発生させることにより、選好ペアを構成するコントラッシブデータ拡張手法を提案する。このペアデータから直接選好最適化(DPO)を用いることで、プロンプトとモデル応答間のポイントワイド条件付き相互情報(MI)を最大化することを示す。驚いたことに、MIPOは数学と多重選択問題のパフォーマンス向上にも応用できる。
参考スコア（独自算出の注目度）: 12.193946608981276
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high-quality data is expensive to collect. More fundamentally, true intelligence goes far beyond tasks that are easily verifiable. Therefore, we need self-improvement frameworks that allow models to improve without external oversight. We propose *Mutual Information Preference Optimization (MIPO)*, a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization (DPO) to learn from this paired data maximizes pointwise conditional mutual information (MI) (under the base LLM) between prompts and model responses. Empirical results with various-sized Llama- and Qwen-Instruct models show that when used to maximize MI between user context and response, MIPO provides an effective personalization technique, achieving 3-40% improvements on personalization tasks using real-user datasets compared to strong baselines. Surprisingly, MIPO can also be applied to improve performance on math and multiple-choice problems, yielding 1-18% **without any additional data or human supervision**. These results suggest a promising direction for self-improvement.
Abstract（参考訳）: ポストトレーニングは、さまざまな領域にわたる大きな言語モデル(LLM)の改善に成功しているが、これらは、人間のラベル付きデータや外部検証に大きく依存している。既存のデータはすでに利用されており、新しい高品質なデータを集めるのに費用がかかる。より根本的には、真の知性は容易に検証可能なタスクを超えています。したがって、外部の監視なしにモデルを改善できる自己改善フレームワークが必要です。提案手法は,適切なプロンプトに対して正の応答条件を生成し,ランダムな無関係なプロンプトに対して負の応答を発生させることにより,選好ペアを構成するコントラッシブなデータ拡張手法である。このペアデータから直接選好最適化(DPO)を用いて、プロンプトとモデル応答の間の(ベースLLMの下での)ポイントワイド条件相互情報(MI)を最大化する。様々なサイズのLlama-およびQwen-Instructモデルによる実験結果から、ユーザコンテキストと応答のMIを最大化するために使用すると、MIPOは効果的なパーソナライズ技術を提供し、強力なベースラインに比べて実際のユーザデータセットを使用したパーソナライズタスクを3～40%改善する。驚いたことに、MIPOは数学と多重選択問題のパフォーマンス向上にも適用でき、1-18%**追加のデータや人的監督**なしで得られる。これらの結果は自己改善の有望な方向性を示唆している。

論文の概要: Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data

関連論文リスト