Attention Where It Matters: Rethinking Visual Document Understanding
with Selective Region Concentration
- URL: http://arxiv.org/abs/2309.01131v1
- Date: Sun, 3 Sep 2023 10:14:34 GMT
- Title: Attention Where It Matters: Rethinking Visual Document Understanding
with Selective Region Concentration
- Authors: Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu,
Yinsong Liu, Deqiang Jiang, Xing Sun
- Abstract summary: We propose a novel end-to-end document understanding model called SeRum.
SeRum converts image understanding and recognition tasks into a local decoding process of the visual tokens of interest.
We show that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks.
- Score: 26.408343160223517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel end-to-end document understanding model called SeRum
(SElective Region Understanding Model) for extracting meaningful information
from document images, including document analysis, retrieval, and office
automation.
Unlike state-of-the-art approaches that rely on multi-stage technical schemes
and are computationally expensive,
SeRum converts document image understanding and recognition tasks into a
local decoding process of the visual tokens of interest, using a content-aware
token merge module.
This mechanism enables the model to pay more attention to regions of interest
generated by the query decoder, improving the model's effectiveness and
speeding up the decoding speed of the generative scheme.
We also designed several pre-training tasks to enhance the understanding and
local awareness of the model.
Experimental results demonstrate that SeRum achieves state-of-the-art
performance on document understanding tasks and competitive results on text
spotting tasks.
SeRum represents a substantial advancement towards enabling efficient and
effective end-to-end document understanding.
Related papers
- Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - GraphKD: Exploring Knowledge Distillation Towards Document Object
Detection with Structured Graph Creation [14.511401955827875]
Object detection in documents is a key step to automate the structural elements identification process.
We present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image.
arXiv Detail & Related papers (2024-02-17T23:08:32Z) - Leveraging Contextual Information for Effective Entity Salience Detection [21.30389576465761]
We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches.
We also show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
arXiv Detail & Related papers (2023-09-14T19:04:40Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Information Extraction from Documents: Question Answering vs Token
Classification in real-world setups [0.0]
We compare the Question Answering approach with the classical token classification approach for document key information extraction.
Our research showed that when dealing with clean and relatively short entities, it is still best to use token classification-based approach.
arXiv Detail & Related papers (2023-04-21T14:43:42Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Focused Attention Improves Document-Grounded Generation [111.42360617630669]
Document grounded generation is the task of using the information provided in a document to improve text generation.
This work focuses on two different document grounded generation tasks: Wikipedia Update Generation task and Dialogue response generation.
arXiv Detail & Related papers (2021-04-26T16:56:29Z) - PICK: Processing Key Information Extraction from Documents using
Improved Graph Learning-Convolutional Networks [5.210482046387142]
Key Information Extraction from documents remains a challenge.
We introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE.
Our method outperforms baselines methods by significant margins.
arXiv Detail & Related papers (2020-04-16T05:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.