Fugu-MT 論文翻訳(概要): TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

論文の概要: TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

arxiv url: http://arxiv.org/abs/2605.12288v2
Date: Thu, 14 May 2026 15:18:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 18:18:46.740803
Title: TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Title（参考訳）: TokenRatio: Ratio Matchingによる原則的なToken-Level Preference Optimization
Authors: Truong Nguyen, Tien-Phat Nguyen, Linh Ngo Van, Duy Minh Ho Nguyen, Khoa D. Doan, Trung Le,
Abstract要約: 標準的なシーケンスレベルのペアワイズ比較のみを用いてトークンレベルの最適性を回復する方法を示す。本稿では,軽量な状態ベースラインを明示的に学習するTBPO-Qと,高次正規化によりベースラインを除去するTBPO-Aの2点を紹介する。
参考スコア（独自算出の注目度）: 20.353416189523006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
Abstract（参考訳）: 直接選好最適化(DPO)は言語モデルをペアの選好から整列するためのRLフリーな手法として広く用いられているが、生成はトーケン毎の決定によって駆動されるにもかかわらず、全シーケンスよりも選好をモデル化する。既存のトークンレベルの拡張は通常、時間ステップにわたってシーケンスレベルのBradley-Terry目的を分解し、前置詞ごとの最適性を暗黙的に残す。標準的なシーケンスレベルのペアワイズ比較のみを用いてトークンレベルの選好最適性を回復する方法について検討する。本稿では,トークンレベルのBregman Preference Optimization(TBPO)を紹介し,トークンレベルのBradley-Terry選好モデルをプレフィックス上に条件付きで設定した次トーケン動作に対して提案し,トークンレベルのモデルによって誘導される最適ポリシを保ちつつ,ロジスティック/DPO損失を一般化するBregman-Diversergence density-ratioマッチング目標を導出する。本稿では,軽量な状態ベースラインを明示的に学習するTBPO-Qと,高次正規化によりベースラインを除去するTBPO-Aの2点を紹介する。 TBPOは、命令、有用性/無害性、および要約ベンチマーク全体にわたって、アライメント品質とトレーニング安定性を改善し、強いシーケンスレベルとトークンレベルのベースラインに対する出力の多様性を向上させる。

論文の概要: TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

関連論文リスト