Multimodal Transformer With a Low-Computational-Cost Guarantee
- URL: http://arxiv.org/abs/2402.15096v1
- Date: Fri, 23 Feb 2024 05:09:35 GMT
- Title: Multimodal Transformer With a Low-Computational-Cost Guarantee
- Authors: Sungjin Park and Edward Choi
- Abstract summary: Low-Cost Multimodal Transformer (LoCoMT) is a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss.
LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants.
- Score: 14.852842445363505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have significantly improved performance across a
range of multimodal understanding tasks, such as visual question answering and
action recognition. However, multimodal Transformers significantly suffer from
a quadratic complexity of the multi-head attention with the input sequence
length, especially as the number of modalities increases. To address this, we
introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal
attention mechanism that aims to reduce computational cost during training and
inference with minimal performance loss. Specifically, by assigning different
multimodal attention patterns to each attention head, LoCoMT can flexibly
control multimodal signals and theoretically ensures a reduced computational
cost compared to existing multimodal Transformer variants. Experimental results
on two multimodal datasets, namely Audioset and MedVidCL demonstrate that
LoCoMT not only reduces GFLOPs but also matches or even outperforms established
models.
Related papers
- HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding [67.24430397016275]
We propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner.
The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.
arXiv Detail & Related papers (2025-03-12T06:01:05Z) - Multiset Transformer: Advancing Representation Learning in Persistence Diagrams [11.512742322405906]
Multiset Transformer is a neural network that utilizes attention mechanisms specifically designed for multisets as inputs.
The architecture integrates multiset-enhanced attentions with a pool-decomposition scheme, allowing multiplicities to be preserved across equivariant layers.
Experimental results demonstrate that the Multiset Transformer outperforms existing neural network methods in the realm of persistence diagram representation learning.
arXiv Detail & Related papers (2024-11-22T01:38:47Z) - GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis [0.0]
Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze human sentiment.
Existing MSA models generally employ cutting-edge multimodal fusion and representation learning-based methods to promote MSA capability.
Our proposed GSIFN incorporates two main components to solve these problems: (i) a graph-structured and interlaced-masked multimodal Transformer.
It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computational overhead.
arXiv Detail & Related papers (2024-08-27T06:44:28Z) - MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Sparse Fusion for Multimodal Transformers [7.98117428941095]
We present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers.
Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling.
State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements.
arXiv Detail & Related papers (2021-11-23T16:43:49Z) - MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
Recognition [11.573689558780764]
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-Vi) for video action recognition.
In order to handle large number of tokens extracted from multiple modalities, we develop several model variants which factorize self-attention across the space, time and modality dimensions.
Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy.
arXiv Detail & Related papers (2021-08-20T18:05:39Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z) - Multi-Unit Transformers for Neural Machine Translation [51.418245676894465]
We propose the Multi-Unit Transformers (MUTE) to promote the expressiveness of the Transformer.
Specifically, we use several parallel units and show that modeling with multiple units improves model performance and introduces diversity.
arXiv Detail & Related papers (2020-10-21T03:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.