Related papers: PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

URL: http://arxiv.org/abs/2503.04065v2
Date: Mon, 10 Mar 2025 03:22:24 GMT
Title: PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Authors: Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu,
Abstract summary: PP-DocBee is a novel multimodal large language model designed for end-to-end document image understanding.<n>We develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization.<n>We apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies.
Score: 10.214889337096773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal large language model designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.

Related papers

Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z)
PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding [2.778335169230448]
PP-DocBee2 is an advanced version of the PP-DocBee, designed to enhance multimodal document understanding.<n>Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements.<n>These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version.
arXiv Detail & Related papers (2025-06-22T13:06:13Z)
DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral [11.336757553731639]
Acquiring structured data from domain-specific, image-based documents is crucial for many downstream tasks.<n>Many documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems.<n>We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform.
arXiv Detail & Related papers (2025-05-06T06:02:42Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z)
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z)
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding [91.17151775296234]
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding. Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space.
arXiv Detail & Related papers (2023-11-20T14:42:25Z)
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents [18.080447065002392]
We propose DocumentCLIP to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents. Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content.
arXiv Detail & Related papers (2023-06-09T23:51:11Z)
PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream [33.68263291948121]
We propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS) We introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents.
arXiv Detail & Related papers (2023-02-10T23:43:46Z)
XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model. XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding. Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document. It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z)
Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents. Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents. Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.