MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
- URL: http://arxiv.org/abs/2509.22186v2
- Date: Mon, 29 Sep 2025 16:41:28 GMT
- Title: MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
- Authors: Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He,
- Abstract summary: MinerU2.5 is a document parsing model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency.<n>Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition.
- Score: 117.58619053719251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
Related papers
- PARL: Position-Aware Relation Learning Network for Document Layout Analysis [23.497081928689525]
We argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure.<n>We propose a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure.<n>Experiments show that PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models.
arXiv Detail & Related papers (2026-01-12T15:05:35Z) - Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval [15.126709823382539]
This work advances Contrastive Language-Image Pre-training (CLIP) for person representation learning.<n>We develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs.<n>We introduce the GA-DMS framework, which improves cross-modal alignment by adaptively masking noisy textual tokens.
arXiv Detail & Related papers (2025-09-11T03:06:22Z) - GRASP: Geospatial pixel Reasoning viA Structured Policy learning [16.023628299873494]
GRASP is a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner.<n> PRIME is a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives.<n>We release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks.
arXiv Detail & Related papers (2025-08-23T18:05:06Z) - From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge [0.0]
Key information from 2D engineering drawings is essential for advancing digital manufacturing.<n>Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text.<n>We propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language model (VLM)
arXiv Detail & Related papers (2025-06-20T17:10:01Z) - Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z) - HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis [21.25786478579275]
Handwritten document recognition is one of the most challenging tasks in computer vision.<n>Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis.<n>This paper introduces HAND, a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks.
arXiv Detail & Related papers (2024-12-25T20:36:29Z) - SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images [4.269350826756809]
This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera.
The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency.
The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks.
arXiv Detail & Related papers (2024-03-15T20:04:27Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - A Visual Representation-guided Framework with Global Affinity for Weakly
Supervised Salient Object Detection [8.823804648745487]
We propose a framework guided by general visual representations with rich contextual semantic knowledge for scribble-based SOD.
These general visual representations are generated by self-supervised learning based on large-scale unlabeled datasets.
Our method achieves comparable or even superior performance to the state-of-the-art fully supervised models.
arXiv Detail & Related papers (2023-02-21T14:31:57Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.