ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document
Understanding
- URL: http://arxiv.org/abs/2209.08569v1
- Date: Sun, 18 Sep 2022 13:46:56 GMT
- Title: ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document
Understanding
- Authors: Wenjin Wang, Zhengjie Huang, Bin Luo, Qianglong Chen, Qiming Peng,
Yinxu Pan, Weichong Yin, Shikun Feng, Yu Sun, Dianhai Yu, Yin Zhang
- Abstract summary: Existing approaches mainly focus on fine-grained elements such as words and document image, making it hard for them to learn from coarse-grained elements.
In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics.
Our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters.
- Score: 31.227481709446746
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent efforts of multimodal Transformers have improved Visually Rich
Document Understanding (VrDU) tasks via incorporating visual and textual
information. However, existing approaches mainly focus on fine-grained elements
such as words and document image patches, making it hard for them to learn from
coarse-grained elements, including natural lexical units like phrases and
salient visual regions like prominent image regions. In this paper, we attach
more importance to coarse-grained elements containing high-density information
and consistent semantics, which are valuable for document understanding. At
first, a document graph is proposed to model complex relationships among
multi-grained multimodal elements, in which salient visual regions are detected
by a cluster-based method. Then, a multi-grained multimodal Transformer called
mmLayout is proposed to incorporate coarse-grained information into existing
pre-trained fine-grained multimodal Transformers based on the graph. In
mmLayout, coarse-grained information is aggregated from fine-grained, and then,
after further processing, is fused back into fine-grained for final prediction.
Furthermore, common sense enhancement is introduced to exploit the semantic
information of natural lexical units. Experimental results on four tasks,
including information extraction and document question answering, show that our
method can improve the performance of multimodal Transformers based on
fine-grained elements and achieve better performance with fewer parameters.
Qualitative analyses show that our method can capture consistent semantics in
coarse-grained elements.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Advanced ingestion process powered by LLM parsing for RAG system [0.0]
This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types.
The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata.
arXiv Detail & Related papers (2024-12-16T20:33:33Z) - Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - Converging Dimensions: Information Extraction and Summarization through Multisource, Multimodal, and Multilingual Fusion [0.0]
The paper proposes a novel approach to summarization that tackles such challenges by utilizing the strength of multiple sources.
The research progresses beyond conventional, unimodal sources such as text documents and integrates a more diverse range of data, including YouTube playlists, pre-prints, and Wikipedia pages.
arXiv Detail & Related papers (2024-06-19T17:15:47Z) - Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation [51.80447197290866]
Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given knowledge graphs.
Existing MMKGC methods usually extract multi-modal features with pre-trained models.
We introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities.
arXiv Detail & Related papers (2024-04-15T05:40:41Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.