Fugu-MT 論文翻訳(概要): Bottleneck Tokens for Unified Multimodal Retrieval

論文の概要: Bottleneck Tokens for Unified Multimodal Retrieval

arxiv url: http://arxiv.org/abs/2604.11095v1
Date: Mon, 13 Apr 2026 07:12:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.387897
Title: Bottleneck Tokens for Unified Multimodal Retrieval
Title（参考訳）: 統合マルチモーダル検索のためのボトルネックトークン
Authors: Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, Yuchao Zheng,
Abstract要約: マルチモーダル検索のためのデコーダのみのマルチモーダル大言語モデル(MLLM)の適用には、2つの構造的ギャップがある。まず、既存のメソッドは暗黙のプーリングに依存しており、シーケンスレベルの表現として標準語彙トークンの隠れた状態をオーバーロードする。第二に、コントラスト的な微調整は、埋め込みが一致すべきものを特定するが、どのように情報を圧縮すべきかについてのトークンレベルのガイダンスは提供しない。本稿では,Bottleneck Tokens(BToks)を紹介した。これは,固定容量明示的なプール機構として機能する,学習可能なトークンの小さなセットである。
参考スコア（独自算出の注目度）: 16.707536543758344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., <EOS>) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).
Abstract（参考訳）: マルチモーダル検索のためのデコーダのみのマルチモーダル大言語モデル(MLLM)の適用には、2つの構造的ギャップがある。まず、既存のメソッドは暗黙のプーリングに依存し、標準的な語彙トークン(例えば、<EOS>)の隠された状態をシーケンスレベルの表現としてオーバーロードします。第二に、コントラスト的な微調整は、埋め込みが一致すべきものを特定するが、どのように情報を圧縮すべきかについてのトークンレベルのガイダンスは提供しない。 2つの相補的なコンポーネントで、両方のギャップに対処する。アーキテクチャ的にはBottleneck Tokens(BToks)を紹介します。これは、固定容量明示的なプール機構として機能する、学習可能なトークンの小さなセットです。トレーニングのために,ターゲットトークンからクエリトークンへの直接的注意経路を分離するコンデンサマスクと組み合わせた,次世代の予測目標であるジェネレーティブ・インフォメーション・コンデンサを提案する。これにより、すべての予測信号がBToksを通して強制され、生成損失を意味的圧縮のための密度の高いトークンレベルの監視に変換する。推論時には、入力とBTokだけが単一のフォワードパスで処理され、従来のラストトケンプーリングのオーバーヘッドは無視できる。 MMEB-V2(78データセット、3モーダル、9メタタスク)では、比較データ条件下での2Bスケール手法の最先端化を実現し、59.0(+3.6 over VLM2Vec-V2)の総合スコアを達成し、セマンティックな要求タスク(ビデオQAでは+12.6)に大きく貢献した。

論文の概要: Bottleneck Tokens for Unified Multimodal Retrieval

関連論文リスト