DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
- URL: http://arxiv.org/abs/2511.11552v1
- Date: Fri, 14 Nov 2025 18:42:18 GMT
- Title: DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
- Authors: Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon,
- Abstract summary: We propose DocLens, a tool-augmented multi-agent framework that effectively zooms in'' on evidence like a lens.<n>It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer.<n>It achieves state-of-the-art performance on MMLongBench-Doc and FinRAG-V, surpassing even human experts.
- Score: 59.4112754806335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
Related papers
- Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z) - ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents [17.497004687630742]
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents.<n>We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents.
arXiv Detail & Related papers (2025-10-29T16:32:26Z) - MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding [66.23502779435053]
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks.
Existing benchmarks either contain limited fine-grained evaluation samples mixed with other data, or are confined to object-level assessments in natural images.
We propose using document images with multi-granularity and multi-modal information to supplement natural images.
arXiv Detail & Related papers (2024-10-25T16:00:55Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - ColPali: Efficient Document Retrieval with Vision Language Models [15.369861972085136]
We introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings.<n>The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages.<n>We release ColPali, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages.
arXiv Detail & Related papers (2024-06-27T15:45:29Z) - Focus Anywhere for Fine-grained Multi-page Document Understanding [24.76897786595502]
This paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents.
We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages.
We render cross-vocabulary vision data as the foreground to achieve a full reaction of multiple visual vocabularies and in-document figure understanding.
arXiv Detail & Related papers (2024-05-23T08:15:49Z) - HRVDA: High-Resolution Visual Document Assistant [32.51417315241559]
We propose a High-Resolution Visual Document Assistant (HRVDA) to bridge the gap between MLLMs and visual document understanding.
HRVDA employs a content filtering mechanism and an instruction filtering module to filter out the content-agnostic visual tokens and instruction-agnostic visual tokens.
Our model achieves state-of-the-art performance across multiple document understanding datasets.
arXiv Detail & Related papers (2024-04-10T11:10:50Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding [91.17151775296234]
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding.
Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space.
arXiv Detail & Related papers (2023-11-20T14:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.