Related papers: SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

URL: http://arxiv.org/abs/2510.26615v2
Date: Sat, 01 Nov 2025 21:48:18 GMT
Title: SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Authors: Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar,
Abstract summary: We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-slide documents.<n>During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers.
Score: 28.839192349010048
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

Related papers

APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design [28.89702589792701]
We propose APD-agents, a large language model driven multi-agent framework for app page design.<n>Our work fully leverages the automatic collaboration capabilities of large-model-driven multi-agent systems.<n> Experimental results on the RICO dataset show that APD-agents achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-11-18T03:39:26Z)
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding [59.4112754806335]
We propose DocLens, a tool-augmented multi-agent framework that effectively zooms in'' on evidence like a lens.<n>It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer.<n>It achieves state-of-the-art performance on MMLongBench-Doc and FinRAG-V, surpassing even human experts.
arXiv Detail & Related papers (2025-11-14T18:42:18Z)
ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents [17.497004687630742]
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents.<n>We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents.
arXiv Detail & Related papers (2025-10-29T16:32:26Z)
MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z)
Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z)
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding [40.52017994491893]
MDocAgent is a novel RAG and multi-agent framework that leverages both text and image.<n>Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent.<n>Preliminary experiments on five benchmarks demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1%.
arXiv Detail & Related papers (2025-03-18T06:57:21Z)
AppAgent v2: Advanced Agent for Flexible Mobile Interactions [57.98933460388985]
This work introduces a novel LLM-based multimodal agent framework for mobile devices.<n>Our agent constructs a flexible action space that enhances adaptability across various applications.<n>Our results demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios.
arXiv Detail & Related papers (2024-08-05T06:31:39Z)
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD) We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z)
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models [9.232693392690702]
TextHawk is a document-oriented Multimodal Large Language Model (MLLM) It is designed to explore efficient fine-grained perception by designing four dedicated components. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-04-14T09:48:37Z)
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core. Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z)
DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.