Related papers: PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

URL: http://arxiv.org/abs/2506.18023v2
Date: Wed, 25 Jun 2025 02:40:39 GMT
Title: PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
Authors: Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu,
Abstract summary: PP-DocBee2 is an advanced version of the PP-DocBee, designed to enhance multimodal document understanding.<n>Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements.<n>These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version.
Score: 2.778335169230448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4\%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0\%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.

Related papers

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z)
QZhou-Embedding Technical Report [16.213081669689185]
Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies.<n>Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance.
arXiv Detail & Related papers (2025-08-29T13:47:22Z)
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks [10.214889337096773]
PP-DocBee is a novel multimodal large language model designed for end-to-end document image understanding.<n>We develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization.<n>We apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies.
arXiv Detail & Related papers (2025-03-06T03:43:21Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines [63.22096609916707]
Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) is a novel task that enables foundation models to process multi-modal web content.<n>Despite its potential impact, M$2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources.
arXiv Detail & Related papers (2024-11-25T13:20:19Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.<n>Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.<n>We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development.<n>This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
A Multi-Document Coverage Reward for RELAXed Multi-Document Summarization [11.02198476454955]
We propose fine-tuning an MDS baseline with a reward that balances a reference-based metric with coverage of the input documents. Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline.
arXiv Detail & Related papers (2022-03-06T07:33:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.