Fugu-MT 論文翻訳(概要): Reason to Contrast: A Cascaded Multimodal Retrieval Framework

論文の概要: Reason to Contrast: A Cascaded Multimodal Retrieval Framework

arxiv url: http://arxiv.org/abs/2602.23369v1
Date: Sun, 21 Dec 2025 04:52:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:07.92837
Title: Reason to Contrast: A Cascaded Multimodal Retrieval Framework
Title（参考訳）: コントラストへの理由: カスケードされたマルチモーダル検索フレームワーク
Authors: Xuanming Cui, Hong-You Chen, Hao Yu, Hao Yuan, Zihao Wang, Shlok Kumar Mishra, Hanchao Yu, Yonghuan Yang, Jun Xiao, Ser-Nam Lim, Jianpeng Cheng, Qi Guo, Xiangjun Fan,
Abstract要約: ハイブリッドマルチモーダル検索フレームワークであるTTE-v2では、モデルや埋め込みサイズではなく、追加の入力トークン予算に基づく推論駆動のパフォーマンススケーリングが導入されている。提案手法は,初期マルチモーダル検索をさらに強化し,テスト時により表現力の高いクエリ・候補間相互作用を可能にする。 MMEB-V2ベンチマークの実験では、TTE-v2-7Bは75.7%の新しい最先端の精度を実現し、TTE-v2-2Bは、かなり大きな外部データで訓練された7Bモデルと一致または上回った。
参考スコア（独自算出の注目度）: 60.99421225506685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional multimodal retrieval systems rely primarily on bi-encoder architectures, where performance is closely tied to embedding dimensionality. Recent work, Think-Then-Embed (TTE), shows that incorporating multimodal reasoning to elicit additional informative tokens before embedding can further improve retrieval. In this paper, we extend this paradigm with TTE-v2, a hybrid multimodal retrieval framework that introduces reasoning-driven performance scaling based on additional input token budget rather than model or embedding size. Our approach augments the initial multimodal retrieval with additional reasoning steps for reranking, enabling more expressive query-candidate interactions at test time. The reranking stage further provides fine-grained supervision for hard negative mining and false negative filtering, creating a feedback loop that effectively strengthens the upstream retriever. This cascaded design delivers substantial test-time improvements based on intermediate reasoning token scaling. Experiments on the MMEB-V2 benchmark demonstrate that TTE-v2-7B achieves a new state-of-the-art accuracy of 75.7%, and that TTE-v2-2B matches or surpasses leading 7B models trained with significantly larger external data. Our results highlight the promise of token-wise scaling as an alternative scaling paradigm for multimodal retrieval.
Abstract（参考訳）: 従来のマルチモーダル検索システムは主にバイエンコーダアーキテクチャに依存しており、性能は埋め込み次元と密接に結びついている。最近の研究であるThink-Then-Embed (TTE)は、埋め込み前に付加的な情報トークンを引き出すためにマルチモーダル推論を取り入れることで、検索をさらに改善できることを示している。本稿では、モデルや埋め込みサイズではなく、追加の入力トークン予算に基づく推論駆動のパフォーマンススケーリングを導入するハイブリッドマルチモーダル検索フレームワークであるTTE-v2を用いて、このパラダイムを拡張した。提案手法は,初期マルチモーダル検索をさらに強化し,テスト時により表現力の高いクエリ・候補間相互作用を可能にする。さらに、リグレードステージは、厳しい負のマイニングと偽の負のフィルタリングのためのきめ細かい監督を提供し、上流レトリバーを効果的に強化するフィードバックループを生成する。このケースドデザインは、中間推論トークンのスケーリングに基づいて、テスト時間を大幅に改善する。 MMEB-V2ベンチマークの実験では、TTE-v2-7Bは75.7%の新しい最先端の精度を実現し、TTE-v2-2Bは、かなり大きな外部データで訓練された7Bモデルと一致または上回った。本結果は,マルチモーダル検索のための代替スケーリングパラダイムとしてトークンワイズスケーリングが期待できることを示す。

論文の概要: Reason to Contrast: A Cascaded Multimodal Retrieval Framework

関連論文リスト