Fugu-MT 論文翻訳(概要): Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

論文の概要: Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

arxiv url: http://arxiv.org/abs/2606.04300v1
Date: Wed, 03 Jun 2026 00:08:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.434925
Title: Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval
Title（参考訳）: Argus-Retriever: 視覚文書検索のための領域対応クエリ記述型MoEを用いたビジョンLLM遅延インタラクション検索
Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mohammed Ali, Adam Jatowt,
Abstract要約: textbfArgusはQwen3.5-VL上に構築されたクエリ条件の遅延インタラクションレトリバーのファミリーである。 textbf9B モデルは ViDoRe V1 で textbf92.67 NDCG@5 に到達し、V1+V2 のリーダーボードで textbf86.0 NDCG@5 に到達した。
参考スコア（独自算出の注目度）: 21.24115784579366
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Late-interaction vision-language retrievers represent each document page as many visual token embeddings and score queries with MaxSim. In systems such as ColPali, ColQwen, ColNomic, and Nemotron ColEmbed, the document embeddings are produced without seeing the query, so the same page is represented identically for a table lookup, a chart question, and a layout-sensitive evidence request. We introduce \textbf{Argus}, a family of query-conditioned late-interaction retrievers built on Qwen3.5-VL. Argus adds a region-aware Mixture-of-Experts module: the query encoder produces both retrieval embeddings and a compact context vector, the document page is pooled into spatial regions, and a query-aware router selects latent experts per region before MaxSim. The output remains a multi-vector index compatible with ColPali-style retrieval, but the document representation is now dependent on the query (i.e., $\mathbf{D}(q)$). All Argus models use a 1024-dimensional retrieval head, compared with the 2560-dimensional and 4096-dimensional heads of recent state-of-the-art systems, and are trained on roughly 9\% of the available public supervision rather than the full pool. The 9B model reaches \textbf{92.67} NDCG@5 on ViDoRe V1 and \textbf{86.0} NDCG@5 on the combined V1+V2 leaderboard, the highest reported value for an open late-interaction model on the combined leaderboard. Wrapped in a Qwen3.6-27B agentic retrieval pipeline on ViDoRe V3, Argus-9B further improves its NDCG@10 from 60.28 to \textbf{64.80} over public tasks, showing that the same retriever serves both as a strong standalone system and as a search primitive for iterative LLM agents.
Abstract（参考訳）: 後期対話型視覚言語検索は、各ドキュメントページを多くのビジュアルトークンの埋め込みとして表現し、MaxSimでクエリをスコアする。 ColPali、ColQwen、ColNomic、Nemotron ColEmbedなどのシステムでは、ドキュメントの埋め込みはクエリを見ることなく生成されるため、同じページはテーブルのルックアップ、チャートの質問、レイアウトに敏感なエビデンス要求に対して同一に表現される。 Qwen3.5-VL上に構築されたクエリ条件付き遅延インタラクションレトリバーのファミリーである‘textbf{Argus} を紹介する。クエリエンコーダは検索埋め込みとコンパクトなコンテキストベクトルの両方を生成し、ドキュメントページは空間領域にプールされ、クエリ対応ルータはMaxSimの前に各領域ごとに潜伏した専門家を選択する。出力はColPaliスタイルの検索と互換性のあるマルチベクターインデックスのままだが、ドキュメント表現はクエリ(例えば$\mathbf{D}(q)$)に依存している。すべてのArgusモデルは、最近の最先端システムの2560次元および4096次元のヘッドと比較して、1024次元の検索ヘッドを使用し、フルプールではなく、利用可能な公共監督の約9倍で訓練されている。 9BモデルはViDoRe V1で \textbf{92.67} NDCG@5、V1+V2リーダーボードで \textbf{86.0} NDCG@5に達する。 ViDoRe V3上のQwen3.6-27Bエージェント検索パイプラインで書かれたArgus-9Bは、NDCG@10を 60.28 から \textbf{64.80} に改善し、同じレトリバーが強力なスタンドアロンシステムとして機能し、反復LDMエージェントの検索プリミティブとして機能することを示した。

論文の概要: Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

関連論文リスト