MGDoc: Pre-training with Multi-granular Hierarchy for Document Image
Understanding
- URL: http://arxiv.org/abs/2211.14958v1
- Date: Sun, 27 Nov 2022 22:47:37 GMT
- Title: MGDoc: Pre-training with Multi-granular Hierarchy for Document Image
Understanding
- Authors: Zilong Wang, Jiuxiang Gu, Chris Tensmeyer, Nikolaos Barmpalios, Ani
Nenkova, Tong Sun, Jingbo Shang, Vlad I. Morariu
- Abstract summary: spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks.
Existing methods learn features from either word-level or region-level but fail to consider both simultaneously.
We propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time.
- Score: 53.03978356918377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document images are a ubiquitous source of data where the text is organized
in a complex hierarchical structure ranging from fine granularity (e.g.,
words), medium granularity (e.g., regions such as paragraphs or figures), to
coarse granularity (e.g., the whole page). The spatial hierarchical
relationships between content at different levels of granularity are crucial
for document image understanding tasks. Existing methods learn features from
either word-level or region-level but fail to consider both simultaneously.
Word-level models are restricted by the fact that they originate from pure-text
language models, which only encode the word-level context. In contrast,
region-level models attempt to encode regions corresponding to paragraphs or
text blocks into a single embedding, but they perform worse with additional
word-level features. To deal with these issues, we propose MGDoc, a new
multi-modal multi-granular pre-training framework that encodes page-level,
region-level, and word-level information at the same time. MGDoc uses a unified
text-visual encoder to obtain multi-modal features across different
granularities, which makes it possible to project the multi-granular features
into the same hyperspace. To model the region-word correlation, we design a
cross-granular attention mechanism and specific pre-training tasks for our
model to reinforce the model of learning the hierarchy between regions and
words. Experiments demonstrate that our proposed model can learn better
features that perform well across granularities and lead to improvements in
downstream tasks.
Related papers
- Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification [20.434941308959786]
Long document classification presents challenges due to their extensive content and complex structure.
Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents.
Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts.
arXiv Detail & Related papers (2024-10-03T19:25:01Z) - Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text.
We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z) - Multi-modal Generation via Cross-Modal In-Context Learning [50.45304937804883]
We propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences.
Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts.
arXiv Detail & Related papers (2024-05-28T15:58:31Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Text Reading Order in Uncontrolled Conditions by Sparse Graph
Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order.
The model is language-agnostic and runs effectively across multi-language datasets.
It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Learning Multiscale Transformer Models for Sequence Generation [33.73729074207944]
We build a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge.
Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.
arXiv Detail & Related papers (2022-06-19T07:28:54Z) - SMDT: Selective Memory-Augmented Neural Document Translation [53.4627288890316]
We propose a Selective Memory-augmented Neural Document Translation model to deal with documents containing large hypothesis space of context.
We retrieve similar bilingual sentence pairs from the training corpus to augment global context.
We extend the two-stream attention model with selective mechanism to capture local context and diverse global contexts.
arXiv Detail & Related papers (2022-01-05T14:23:30Z) - Language Through a Prism: A Spectral Approach for Multiscale Language
Representations [30.224517199646993]
We show that signal processing provides a natural framework for separating structure across scales.
We apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging.
We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales.
arXiv Detail & Related papers (2020-11-09T23:17:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.