ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
- URL: http://arxiv.org/abs/2510.25668v1
- Date: Wed, 29 Oct 2025 16:32:26 GMT
- Title: ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
- Authors: Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp,
- Abstract summary: Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents.<n>We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents.
- Score: 17.497004687630742
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
Related papers
- Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z) - Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z) - DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding [59.4112754806335]
We propose DocLens, a tool-augmented multi-agent framework that effectively zooms in'' on evidence like a lens.<n>It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer.<n>It achieves state-of-the-art performance on MMLongBench-Doc and FinRAG-V, surpassing even human experts.
arXiv Detail & Related papers (2025-11-14T18:42:18Z) - SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models [0.0]
Understanding long-context visual information remains a fundamental challenge for vision-language models.<n>We propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism.<n>SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering.
arXiv Detail & Related papers (2025-10-22T17:47:12Z) - Structured Attention Matters to Multimodal LLMs in Document Understanding [52.37530640460363]
We investigate how input format influences document comprehension performance.<n>We discover that raw OCR text often impairs rather than improves MLLMs' performance.<n>We propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm.
arXiv Detail & Related papers (2025-06-19T07:16:18Z) - Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - ColPali: Efficient Document Retrieval with Vision Language Models [15.369861972085136]
We introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings.<n>The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages.<n>We release ColPali, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages.
arXiv Detail & Related papers (2024-06-27T15:45:29Z) - Focus Anywhere for Fine-grained Multi-page Document Understanding [24.76897786595502]
This paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents.
We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages.
We render cross-vocabulary vision data as the foreground to achieve a full reaction of multiple visual vocabularies and in-document figure understanding.
arXiv Detail & Related papers (2024-05-23T08:15:49Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding [88.88844606781987]
Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks.<n>The way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy.<n>In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals.
arXiv Detail & Related papers (2022-06-27T09:58:34Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.