Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided
Dynamic Token Merge for Document Understanding
- URL: http://arxiv.org/abs/2305.11392v1
- Date: Fri, 19 May 2023 02:42:35 GMT
- Title: Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided
Dynamic Token Merge for Document Understanding
- Authors: Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan
Zhang, Kun Yao, Yuwei Wu, Yunde Jia
- Abstract summary: General efficient transformers are challenging to be directly adapted to model document.
Fast-StrucTexT is an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture.
Our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.
- Score: 40.322453628755376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers achieve promising performance in document understanding because
of their high effectiveness and still suffer from quadratic computational
complexity dependency on the sequence length. General efficient transformers
are challenging to be directly adapted to model document. They are unable to
handle the layout representation in documents, e.g. word, line and paragraph,
on different granularity levels and seem hard to achieve a good trade-off
between efficiency and performance. To tackle the concerns, we propose
Fast-StrucTexT, an efficient multi-modal framework based on the StrucTexT
algorithm with an hourglass transformer architecture, for visual document
understanding. Specifically, we design a modality-guided dynamic token merging
block to make the model learn multi-granularity representation and prunes
redundant tokens. Additionally, we present a multi-modal interaction module
called Symmetry Cross Attention (SCA) to consider multi-modal fusion and
efficiently guide the token mergence. The SCA allows one modality input as
query to calculate cross attention with another modality in a dual phase.
Extensive experiments on FUNSD, SROIE, and CORD datasets demonstrate that our
model achieves the state-of-the-art performance and almost 1.9X faster
inference time than the state-of-the-art methods.
Related papers
- AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers [13.480259378415505]
BiXT scales linearly with input size in terms of computational cost and memory consumption.
BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module.
By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences.
arXiv Detail & Related papers (2024-02-19T13:38:15Z) - CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers [53.224004166460254]
This paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers.
CrossGET adaptively combines tokens in real-time during inference, significantly reducing computational costs.
Experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-27T12:07:21Z) - Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention.
In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts.
In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.