Fugu-MT 論文翻訳(概要): DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

論文の概要: DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

arxiv url: http://arxiv.org/abs/2605.30027v1
Date: Thu, 28 May 2026 14:50:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.402505
Title: DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
Title（参考訳）: DocRetriever: 包括的なベンチマークを備えたマルチモーダルドキュメント検索用プラグイン・アンド・プレイフレームワーク
Authors: Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao,
Abstract要約: マルチモーダル文書には、テーブル、フィギュア、レイアウトなど、さまざまな要素が含まれている。現在のアプローチでは、高精度の検索を実現するために、高密度の視覚埋め込みモデルと教師付きリランカを組み合わせるのが一般的である。本稿では,レイアウトを意識したスパース埋め込み技術による視覚検索を支援するプラグイン・アンド・プレイフレームワークDocRetrieverを提案する。
参考スコア（独自算出の注目度）: 48.84943754804533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.
Abstract（参考訳）: マルチモーダル文書には、テーブル、フィギュア、レイアウトなどの様々な要素が含まれており、検索タスクを複雑にすることができる。現在のアプローチは一般的に、密な視覚埋め込みモデルと教師付きリランカーを組み合わせて高精度検索を行うが、それらは固有の制限に直面している。まず、密接な埋め込みの粗粒度の性質は明示的な意味論を難解にし、構造的に健全な情報を活用できない傾向にある。第二に、教師付きリグレードモデルは、ドメイン固有のトレーニングデータに大きく依存するため、一般化ボトルネックに悩まされる。さらに、既存のベンチマークでは、さまざまな評価範囲と包括的な関連アノテーションが欠如しており、信頼性の高い評価が制限されていることが多い。これらの課題に対処するために,プラグイン・アンド・プレイのフレームワークであるDocRetrieverを提案する。レイアウトを意識したスパース埋め込み技術による視覚検索を強化し、光学文字認識(OCR)のオーバーヘッドを伴わずに効果的なハイブリッド符号化を可能にする。また、推論強化されたデモと最適化されたサンプリングを活用して、数ショット設定での精度を向上させる一般化可能なリランカも導入する。最後に、より厳密な評価を可能にするために、新しいベンチマークであるMultiDocRを構築した。さまざまなベンチマークによる実験は、DocRetrieverの最先端メソッドに対する優位性を検証する。

論文の概要: DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

関連論文リスト