Fugu-MT 論文翻訳(概要): Learning to Rank Caption Chains for Video-Text Alignment

論文の概要: Learning to Rank Caption Chains for Video-Text Alignment

arxiv url: http://arxiv.org/abs/2603.25145v1
Date: Thu, 26 Mar 2026 08:04:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.174074
Title: Learning to Rank Caption Chains for Video-Text Alignment
Title（参考訳）: ビデオテキストアライメントのためのランク付けチェインの学習
Authors: Ansel Blume, Burak Uzkent, Shalini Chaudhuri, Garin Kessler,
Abstract要約: 直接選好最適化(DPO)は、好ましくない応答を生成するために言語モデルを訓練する効果的な手法である。特に、代替品よりも好ましくないとしても、応答は依然として視覚的な入力に忠実であるかもしれない。本研究では,視覚入力に対する応答の忠実度をより正確に評価する代替手段として,ランキング最適化について検討する。
参考スコア（独自算出の注目度）: 6.779243901781581
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.
Abstract（参考訳）: 直接選好最適化(DPO)は、好ましくない応答を生成するために言語モデルを訓練する効果的な手法である。しかし、このバイナリ・テイク・オール(winner-takes-all)アプローチは、応答品質が視覚内容に大きく依存する視覚言語モデルに最適である。特に、代替品よりも好ましくないとしても、応答は依然として視覚的な入力に忠実であるかもしれない。標準的なBradley-Terry DPOの定式化は、このニュアンスを欠き、"ロスング"応答が高い視覚的忠実性を維持しているかどうかを十分に考慮せずに、勝利反応を重み付けしている。本研究では,視覚入力に対する応答の忠実度をより正確に評価する代替手段として,ランキング最適化について検討する。本稿では, ビデオキャプションを用いた動画テキストアライメントに着目し, 繰り返しキャプション劣化を繰り返すことで, 難易度の高い完全順序のキャプションチェーンを大規模に生成する手法を提案する。これらの手法では,視覚エンコーダの微調整を効果的に行う必要があり,DPOを純粋に言語重み付けのプロセスと考えることに挑戦する。

論文の概要: Learning to Rank Caption Chains for Video-Text Alignment

関連論文リスト