Related papers: Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

URL: http://arxiv.org/abs/2505.14059v1
Date: Tue, 20 May 2025 08:03:59 GMT
Title: Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Authors: Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, Can Huang,
Abstract summary: Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.<n>We present textitDolphin, a novel multimodal document image parsing model following an analyze-then-parse paradigm.<n>Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency.
Score: 20.588630224794976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin

Related papers

Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z)
Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting [46.102790941920865]
We present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin.<n>In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis.<n>In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors.
arXiv Detail & Related papers (2026-02-05T07:09:57Z)
Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding [35.429403152845836]
Youtu-Parsing is an efficient and versatile document parsing model designed for high-performance content extraction.<n>The model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content.<n>Youtu-Parsing achieves state-of-the-art (SOTA) performance on the OmniDocBench and olmOCR-bench benchmarks.
arXiv Detail & Related papers (2026-01-28T09:37:13Z)
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.05126590825121]
MonkeyOCR v1.5 is a unified vision-language framework that enhances both layout understanding and content recognition.<n>To address complex table structures, we propose a visual consistency-based reinforcement learning scheme.<n>Two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables.
arXiv Detail & Related papers (2025-11-13T15:12:17Z)
Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing [46.14775667559124]
Document parsing from scanned images remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.<n>Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data.<n>We introduce LayoutRL, a reinforcement learning framework that optimize layout understanding through composite rewards integrating normalized edit distance count accuracy, and reading order preservation.<n>We show that Infinity-Bench consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities.
arXiv Detail & Related papers (2025-10-17T06:26:59Z)
GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing [23.64662356622401]
We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision.<n>Our method achieves an 18.02% improvement in sentence accuracy over the state-of-the-art scene text editing baseline.
arXiv Detail & Related papers (2025-05-08T03:11:58Z)
DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral [11.336757553731639]
Acquiring structured data from domain-specific, image-based documents is crucial for many downstream tasks.<n>Many documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems.<n>We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform.
arXiv Detail & Related papers (2025-05-06T06:02:42Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
We propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images.<n>First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.<n>Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z)
Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism [12.289101189321181]
Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
arXiv Detail & Related papers (2024-04-29T18:07:47Z)
Locate, Assign, Refine: Taming Customized Promptable Image Inpainting [22.163855501668206]
We introduce the multimodal promptable image inpainting project: a new task model, and data for taming customized image inpainting.<n>We propose LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of specific region in images corresponding to the mask prompt.<n>Our LAR-Gen adopts a coarse-to-fine manner to ensure the context consistency of source image, subject identity consistency, local semantic consistency to the text description, and smoothness consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z)
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks. In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z)
Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP. Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z)
Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach [104.2588068730834]
This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature. We develop a method that does not depend on a pre-trained backbone network. We evaluate the proposed approach (called PQDiff) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches.
arXiv Detail & Related papers (2024-01-28T13:00:38Z)
pose-format: Library for Viewing, Augmenting, and Handling .pose Files [4.606561440859961]
This paper presents textttpose-format, a comprehensive toolkit designed to address pose data challenges. The library includes a specialized file format that encapsulates various types of pose data, accommodating multiple individuals and an indefinite number of time frames. textttpose-format emerges as a one-stop solution, streamlining the complexities of pose data management and analysis.
arXiv Detail & Related papers (2023-10-13T12:41:28Z)
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts. Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z)
TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs. We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z)
SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels. The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level. We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.