Fugu-MT 論文翻訳(概要): Token-weighted Direct Preference Optimization with Attention

論文の概要: Token-weighted Direct Preference Optimization with Attention

arxiv url: http://arxiv.org/abs/2605.21883v2
Date: Tue, 26 May 2026 03:18:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:40.879161
Title: Token-weighted Direct Preference Optimization with Attention
Title（参考訳）: 留意点を考慮したトークン重み付き直接選好最適化
Authors: Chengyu Huang, Zhuohang Li, Sheng-Yen Chou, Claire Cardie,
Abstract要約: 本稿ではトークン重み付きRLとアテンションPOに基づく新しいトレーニング目標を提案する。 AttentionPO は LLM 自体からの注意を使ってトークンの重みを推定する。実験の結果,アテンションPOはAlpacaEval,MT-Bench,ArenaHardの性能を著しく向上することがわかった。
参考スコア（独自算出の注目度）: 17.569206072311157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods compute the token weights using either token-position-based heuristic functions or probability estimates given by a separately trained model, which lacks robustness and incurs extra training cost. In contrast, we propose Token-weighted DPO (TwDPO) -- a novel training objective grounded on token-weighted RL -- and AttentionPO -- an instantiation of TwDPO that uses attention from the LLM itself to estimate token weights. AttentionPO prompts the LLM to serve as a pairwise judge and check where the model attends when comparing the responses. This design makes AttentionPO content-aware, adjusting weights based on response content, and efficient, incurring only two extra forward passes per example. Experiment results show that AttentionPO significantly improves performance on AlpacaEval, MT-Bench, and ArenaHard, surpassing existing Preference Optimization methods.
Abstract（参考訳）: 直接選好最適化(DPO)は、大きな言語モデルと人間の選好を、別の報奨モデルなしで調整する。しかし、DPOは反応における全てのトークンを等しく扱い、個々のトークンの異なる重要性を無視している。既存のトークンレベルのPOメソッドは、トークンポジションベースのヒューリスティック関数または個別に訓練されたモデルによって与えられる確率推定を用いてトークン重量を計算する。対照的に,トークン重み付きRLに基づく新たなトレーニング目標であるトークン重み付きDPO (TwDPO) と,トークン重み推定にLLM自体からの注意を用いたTwDPOのインスタンス化を提案する。 AttentionPOはLLMにペアワイズ・ジャッジとして機能するよう促し、レスポンスを比較する際にモデルがどこに出席しているかをチェックする。この設計により、AttentionPOコンテンツに気付き、応答内容に基づいて重みを調整することができ、効率が良く、例ごとに2つの追加のフォワードパスしか発生しない。実験結果から,AlpacaEval,MT-Bench,ArenaHardのアテンションPOは既存の優先度最適化手法をはるかに上回る性能を示した。

論文の概要: Token-weighted Direct Preference Optimization with Attention

関連論文リスト