Improving Unimodal Inference with Multimodal Transformers
- URL: http://arxiv.org/abs/2311.10170v1
- Date: Thu, 16 Nov 2023 19:53:35 GMT
- Title: Improving Unimodal Inference with Multimodal Transformers
- Authors: Kateryna Chumachenko, Alexandros Iosifidis, Moncef Gabbouj
- Abstract summary: Our approach involves a multi-branch architecture that incorporates unimodal models with a multimodal transformer-based branch.
By co-training these branches, the stronger multimodal branch can transfer its knowledge to the weaker unimodal branches through a multi-task objective.
We evaluate our approach on tasks of dynamic hand gesture recognition based on RGB and Depth, audiovisual emotion recognition based on speech and facial video, and audio-videotext based sentiment analysis.
- Score: 88.83765002648833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an approach for improving performance of unimodal models
with multimodal training. Our approach involves a multi-branch architecture
that incorporates unimodal models with a multimodal transformer-based branch.
By co-training these branches, the stronger multimodal branch can transfer its
knowledge to the weaker unimodal branches through a multi-task objective,
thereby improving the performance of the resulting unimodal models. We evaluate
our approach on tasks of dynamic hand gesture recognition based on RGB and
Depth, audiovisual emotion recognition based on speech and facial video, and
audio-video-text based sentiment analysis. Our approach outperforms the
conventionally trained unimodal counterparts. Interestingly, we also observe
that optimization of the unimodal branches improves the multimodal branch,
compared to a similar multimodal model trained from scratch.
Related papers
- S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models [12.397136690734865]
We propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT.
MuDPT extends independent multi-modal prompt tuning by learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion.
Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin.
arXiv Detail & Related papers (2023-06-20T09:15:52Z) - UniS-MMC: Multimodal Classification via Unimodality-supervised
Multimodal Contrastive Learning [29.237813880311943]
We propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting.
Experimental results with fused features on two image-text classification benchmarks show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods.
arXiv Detail & Related papers (2023-05-16T09:18:38Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.