Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding
- URL: http://arxiv.org/abs/2602.12957v1
- Date: Fri, 13 Feb 2026 14:22:10 GMT
- Title: Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding
- Authors: Wenhui Liao, Hongliang Li, Pengyu Xie, Xinyu Cai, Yufan Shen, Yi Xin, Qi Qin, Shenglong Ye, Tianbin Li, Ming Hu, Junjun He, Yihao Liu, Wenhai Wang, Min Dou, Bin Fu, Botian Shi, Yu Qiao, Lianwen Jin,
- Abstract summary: We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
- Score: 102.88996030431662
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.
Related papers
- Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding [35.429403152845836]
Youtu-Parsing is an efficient and versatile document parsing model designed for high-performance content extraction.<n>The model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content.<n>Youtu-Parsing achieves state-of-the-art (SOTA) performance on the OmniDocBench and olmOCR-bench benchmarks.
arXiv Detail & Related papers (2026-01-28T09:37:13Z) - Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z) - Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing [46.14775667559124]
Document parsing from scanned images remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.<n>Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data.<n>We introduce LayoutRL, a reinforcement learning framework that optimize layout understanding through composite rewards integrating normalized edit distance count accuracy, and reading order preservation.<n>We show that Infinity-Bench consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities.
arXiv Detail & Related papers (2025-10-17T06:26:59Z) - DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z) - Logics-Parsing Technical Report [8.982345117231661]
We propose Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning.<n>Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference.<n>We introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories.
arXiv Detail & Related papers (2025-09-24T04:54:37Z) - Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing [46.14775667559124]
layoutRL is an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware.<n>We will publicly release our code and dataset to accelerate progress in robust document understanding.
arXiv Detail & Related papers (2025-06-01T15:19:52Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.