ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document
Understanding
- URL: http://arxiv.org/abs/2209.08569v1
- Date: Sun, 18 Sep 2022 13:46:56 GMT
- Title: ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document
Understanding
- Authors: Wenjin Wang, Zhengjie Huang, Bin Luo, Qianglong Chen, Qiming Peng,
Yinxu Pan, Weichong Yin, Shikun Feng, Yu Sun, Dianhai Yu, Yin Zhang
- Abstract summary: Existing approaches mainly focus on fine-grained elements such as words and document image, making it hard for them to learn from coarse-grained elements.
In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics.
Our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters.
- Score: 31.227481709446746
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent efforts of multimodal Transformers have improved Visually Rich
Document Understanding (VrDU) tasks via incorporating visual and textual
information. However, existing approaches mainly focus on fine-grained elements
such as words and document image patches, making it hard for them to learn from
coarse-grained elements, including natural lexical units like phrases and
salient visual regions like prominent image regions. In this paper, we attach
more importance to coarse-grained elements containing high-density information
and consistent semantics, which are valuable for document understanding. At
first, a document graph is proposed to model complex relationships among
multi-grained multimodal elements, in which salient visual regions are detected
by a cluster-based method. Then, a multi-grained multimodal Transformer called
mmLayout is proposed to incorporate coarse-grained information into existing
pre-trained fine-grained multimodal Transformers based on the graph. In
mmLayout, coarse-grained information is aggregated from fine-grained, and then,
after further processing, is fused back into fine-grained for final prediction.
Furthermore, common sense enhancement is introduced to exploit the semantic
information of natural lexical units. Experimental results on four tasks,
including information extraction and document question answering, show that our
method can improve the performance of multimodal Transformers based on
fine-grained elements and achieve better performance with fewer parameters.
Qualitative analyses show that our method can capture consistent semantics in
coarse-grained elements.
Related papers
- Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - Converging Dimensions: Information Extraction and Summarization through Multisource, Multimodal, and Multilingual Fusion [0.0]
The paper proposes a novel approach to summarization that tackles such challenges by utilizing the strength of multiple sources.
The research progresses beyond conventional, unimodal sources such as text documents and integrates a more diverse range of data, including YouTube playlists, pre-prints, and Wikipedia pages.
arXiv Detail & Related papers (2024-06-19T17:15:47Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Absformer: Transformer-based Model for Unsupervised Multi-Document
Abstractive Summarization [1.066048003460524]
Multi-document summarization (MDS) refers to the task of summarizing the text in multiple documents into a concise summary.
Abstractive MDS aims to generate a coherent and fluent summary for multiple documents using natural language generation techniques.
We propose Absformer, a new Transformer-based method for unsupervised abstractive summary generation.
arXiv Detail & Related papers (2023-06-07T21:18:23Z) - MGDoc: Pre-training with Multi-granular Hierarchy for Document Image
Understanding [53.03978356918377]
spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks.
Existing methods learn features from either word-level or region-level but fail to consider both simultaneously.
We propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time.
arXiv Detail & Related papers (2022-11-27T22:47:37Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations [12.394777121890925]
This paper revisits and substantially extends previous dataset creation efforts.
We show that our extended version uses more representative texts for multi-document tasks and provides a larger and more diverse training set.
arXiv Detail & Related papers (2021-10-09T09:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.