Multimodal Transformer With a Low-Computational-Cost Guarantee
- URL: http://arxiv.org/abs/2402.15096v1
- Date: Fri, 23 Feb 2024 05:09:35 GMT
- Title: Multimodal Transformer With a Low-Computational-Cost Guarantee
- Authors: Sungjin Park and Edward Choi
- Abstract summary: Low-Cost Multimodal Transformer (LoCoMT) is a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss.
LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants.
- Score: 14.852842445363505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have significantly improved performance across a
range of multimodal understanding tasks, such as visual question answering and
action recognition. However, multimodal Transformers significantly suffer from
a quadratic complexity of the multi-head attention with the input sequence
length, especially as the number of modalities increases. To address this, we
introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal
attention mechanism that aims to reduce computational cost during training and
inference with minimal performance loss. Specifically, by assigning different
multimodal attention patterns to each attention head, LoCoMT can flexibly
control multimodal signals and theoretically ensures a reduced computational
cost compared to existing multimodal Transformer variants. Experimental results
on two multimodal datasets, namely Audioset and MedVidCL demonstrate that
LoCoMT not only reduces GFLOPs but also matches or even outperforms established
models.
Related papers
- MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Sparse Fusion for Multimodal Transformers [7.98117428941095]
We present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers.
Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling.
State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements.
arXiv Detail & Related papers (2021-11-23T16:43:49Z) - MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
Recognition [11.573689558780764]
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-Vi) for video action recognition.
In order to handle large number of tokens extracted from multiple modalities, we develop several model variants which factorize self-attention across the space, time and modality dimensions.
Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy.
arXiv Detail & Related papers (2021-08-20T18:05:39Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - Dynamic Clone Transformer for Efficient Convolutional Neural Netwoks [0.0]
We introduce a novel concept termed multi-path fully connected pattern (MPFC) to rethink the interdependencies of topology pattern, accuracy and efficiency for ConvNets.
Inspired by MPFC, we propose a dual-branch module named dynamic clone transformer (DCT) where one branch generates multiple replicas from inputs and another branch reforms those clones through a series of difference vectors conditional on inputs itself to produce more variants.
arXiv Detail & Related papers (2021-06-12T13:42:28Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z) - Multi-Unit Transformers for Neural Machine Translation [51.418245676894465]
We propose the Multi-Unit Transformers (MUTE) to promote the expressiveness of the Transformer.
Specifically, we use several parallel units and show that modeling with multiple units improves model performance and introduces diversity.
arXiv Detail & Related papers (2020-10-21T03:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.