Fugu-MT 論文翻訳(概要): miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

論文の概要: miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

arxiv url: http://arxiv.org/abs/2606.10759v2
Date: Tue, 16 Jun 2026 02:36:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 15:01:46.521111
Title: miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity
Title（参考訳）: miniReranker: Visual Cache ReuseとInteraction Sparsityによる効率的なマルチモーダルリグレード
Authors: Yingqi Fan, Xuan Lu, Anhao Zhao, Junlong Tong, Ping Nie, Kai Zou, Yunpu Ma, Wei Zhang, Xiaoyu Shen,
Abstract要約: MLLM(Multimodal large language model)は、最近ポイントワイド・リランカとして大きな可能性を示している。しかし、ポイントワイドリグレードは、クエリーとドキュメントのペア間で大幅に繰り返される計算に悩まされる。本稿では,キャッシュ再利用の効率化と性能の見直しを両立させる,$textitvision-first$の定式化を提案する。
参考スコア（独自算出の注目度）: 21.54829080388454
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have recently shown strong potential as point-wise rerankers by directly modeling query--document relevance through next-token prediction. However, point-wise reranking suffers from substantial repeated computation across query--document pairs, while the causal structure of transformers allows only prefix segments to be reused via pre-caching. To address the misalignment of existing query-first and document-first formats with both VQA-style prompting and computation-aware reuse, we propose a $\textit{vision-first}$ formulation that improves both cache reuse efficiency and reranking performance. However, the remaining cost is still considerable and stems from three main sources: (1) $\textit{model depth}$, for which we reduce active parameters via early exit; (2) $\textit{cross-segment attention}$, which we restrict to a narrow interaction band across a few layers; and (3) $\textit{visual tokens}$, where we reduce the number of tokens via embedder-guided pruning. Together, these designs form miniReranker, which reduces reranking runtime to <1% of the dense implementation under high-reuse settings for a single query, while preserving >96% of the dense model performance.
Abstract（参考訳）: マルチモーダル・大規模言語モデル (MLLM) は,最近,クエリ-ドキュメント関係を直接モデル化することで,ポイントワイド・リランカとして大きな可能性を秘めている。しかし、ポイントワイドリグレードは、クエリドキュメントペア間で大幅に繰り返される計算に悩まされる一方、トランスフォーマーの因果構造はプレキャッシュによるプレフィックスセグメントのみの再利用を可能にする。 VQAスタイルのプロンプトと計算対応の再利用によって、既存のクエリファーストおよびドキュメントファーストフォーマットの誤調整に対処するため、キャッシュ再利用効率と性能の再評価の両方を改善する$\textit{vision-first}$の定式化を提案する。しかし、残りのコストはまだ高く、(1)$\textit{model depth}$、(2)$\textit{cross-segment attention}$、および(3)$\textit{visual tokens}$の3つの主要なソースから来ている。これらの設計が組み合わさって miniReranker を形成し、単一のクエリに対して高再利用設定で実行時の再ランクを 1% に減らし、高密度モデル性能の 96% を保っている。

論文の概要: miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

関連論文リスト