DocMamba: Efficient Document Pre-training with State Space Model
- URL: http://arxiv.org/abs/2409.11887v2
- Date: Mon, 10 Feb 2025 07:31:35 GMT
- Title: DocMamba: Efficient Document Pre-training with State Space Model
- Authors: Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang,
- Abstract summary: We present DocMamba, a novel framework based on the state space model.
It is designed to reduce computational complexity to linear while preserving global modeling capabilities.
Experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
- Score: 56.84200017560988
- License:
- Abstract: In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
Related papers
- VaeDiff-DocRE: End-to-end Data Augmentation Framework for Document-level Relation Extraction [9.516897428263146]
Document-level Relation Extraction (DocRE) aims to identify relationships between entity pairs within a document.
Most existing methods assume a uniform label distribution, resulting in suboptimal performance on real-world, imbalanced datasets.
We propose a novel data augmentation approach using generative models to enhance data from the embedding space.
arXiv Detail & Related papers (2024-12-18T04:55:29Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.
We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.
Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.301481927603554]
We introduce Doc-YOLO, a novel approach that enhances accuracy while maintaining speed advantages.
For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm.
In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module.
arXiv Detail & Related papers (2024-10-16T14:50:47Z) - Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings.
Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - VAULT: VAriable Unified Long Text Representation for Machine Reading
Comprehension [31.639069657951747]
Existing models on Machine Reading require complex model architecture for modeling long texts with paragraph representation and classification.
We propose VAULT: a light-weight and parallel-efficient paragraph representation for MRC based on contextualized representation from long document input.
arXiv Detail & Related papers (2021-05-07T13:03:43Z) - Efficient Attentions for Long Document Summarization [25.234852272297598]
Hepos is a novel efficient encoder-decoder attention with head-wise positional strides.
We are able to process ten times more tokens than existing models that use full attentions.
arXiv Detail & Related papers (2021-04-05T18:45:13Z) - ERNIE-DOC: The Retrospective Long-Document Modeling Transformer [24.426571160930635]
We propose ERNIE-DOC, a document-level language pretraining model based on Recurrence Transformers.
Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism enable ERNIE-DOC with much longer effective context length.
Various experiments on both English and Chinese document-level tasks are conducted.
arXiv Detail & Related papers (2020-12-31T16:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.