Fugu-MT 論文翻訳(概要): Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval

論文の概要: Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval

arxiv url: http://arxiv.org/abs/2511.21121v1
Date: Wed, 26 Nov 2025 07:18:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-27 18:37:59.006324
Title: Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval
Title（参考訳）: パッチ集約を超えて:ビジョン強化ドキュメント検索のための3パスピラミッドインデックス
Authors: Anup Roy, Rishabh Gyanendra Upadhyay, Animesh Rameshbhai Panara, Robin Mills,
Abstract要約: ドキュメント中心のRAGパイプラインは通常、OCRから始まり、次にチャンキング、テーブル解析、レイアウト再構築のための脆さが続く。我々は,OCRフリーかつモデル非依存なマルチモーダル検索システムであるVisionRAGを紹介する。 VisionRAGは、ドキュメントを直接イメージとしてインデックスし、レイアウト、テーブル、空間的なキューを保存し、特定の抽出にコミットすることなくセマンティックベクターを構築する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document centric RAG pipelines usually begin with OCR, followed by brittle heuristics for chunking, table parsing, and layout reconstruction. These text first workflows are costly to maintain, sensitive to small layout shifts, and often lose the spatial cues that contain the answer. Vision first retrieval has emerged as a strong alternative. By operating directly on page images, systems like ColPali and ColQwen preserve structure and reduce pipeline complexity while achieving strong benchmark performance. However, these late interaction models tie retrieval to a specific vision backbone and require storing hundreds of patch embeddings per page, creating high memory overhead and complicating large scale deployment. We introduce VisionRAG, a multimodal retrieval system that is OCR free and model agnostic. VisionRAG indexes documents directly as images, preserving layout, tables, and spatial cues, and builds semantic vectors without committing to a specific extraction. Our three pass pyramid indexing framework creates vectors using global page summaries, section headers, visual hotspots, and fact level cues. These summaries act as lightweight retrieval surrogates. At query time, VisionRAG retrieves the most relevant pages using the pyramid index, then forwards the raw page image encoded as base64 to a multimodal LLM for final question answering. During retrieval, reciprocal rank fusion integrates signals across the pyramid to produce robust ranking. VisionRAG stores only 17 to 27 vectors per page, matching the efficiency of patch based methods while staying flexible across multimodal encoders. On financial document benchmarks, it achieves 0.8051 accuracy at 10 on FinanceBench and 0.9629 recall at 100 on TAT DQA. These results show that OCR free, summary guided multimodal retrieval is a practical and scalable alternative to traditional text extraction pipelines.
Abstract（参考訳）: ドキュメント中心のRAGパイプラインは通常、OCRから始まり、チャンキング、テーブル解析、レイアウト再構築のための脆いヒューリスティックが続く。これらのテキストファーストワークフローはメンテナンスにコストがかかり、小さなレイアウトシフトに敏感で、答えを含む空間的な手がかりが失われることが多い。ビジョンファースト検索は強力な代替手段として登場した。ページイメージを直接操作することで、ColPaliやColQwenのようなシステムは構造を保存し、強力なベンチマークパフォーマンスを実現しつつ、パイプラインの複雑さを低減することができる。しかしながら、これらの遅延インタラクションモデルは、検索を特定のビジョンバックボーンに結び付け、数百のパッチ埋め込みをページ毎に保存し、メモリオーバーヘッドが高くなり、大規模なデプロイメントが複雑になる。我々は,OCRフリーかつモデル非依存なマルチモーダル検索システムであるVisionRAGを紹介する。 VisionRAGは、ドキュメントを直接イメージとしてインデックスし、レイアウト、テーブル、空間的なキューを保存し、特定の抽出にコミットすることなくセマンティックベクターを構築する。我々の3つのパスピラミッドインデックスフレームワークは、グローバルページサマリー、セクションヘッダ、ビジュアルホットスポット、ファクトレベルのキューを使用してベクトルを生成する。これらのサマリーは軽量な検索サロゲートとして機能する。クエリ時に、VisionRAGはピラミッドインデックスを使用して最も関連性の高いページを取得し、その後、base64としてエンコードされた生のページイメージをマルチモーダルLCMに転送して最終質問応答を行う。検索中、相互ランクの融合はピラミッド全体の信号を統合し、堅牢なランキングを生成する。 VisionRAGは1ページあたり17から27のベクトルしか格納せず、マルチモーダルエンコーダの柔軟性を維持しながら、パッチベースのメソッドの効率と一致する。財務文書のベンチマークでは、ファイナンスベンチで10で0.8051、TAT DQAで100で0.9629のリコールを達成している。これらの結果から,OCRフリーで要約ガイド付きマルチモーダル検索は,従来のテキスト抽出パイプラインに代わる実用的でスケーラブルな代替手段であることが示唆された。

論文の概要: Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval

関連論文リスト