UniS-MMC: Multimodal Classification via Unimodality-supervised
Multimodal Contrastive Learning
- URL: http://arxiv.org/abs/2305.09299v1
- Date: Tue, 16 May 2023 09:18:38 GMT
- Title: UniS-MMC: Multimodal Classification via Unimodality-supervised
Multimodal Contrastive Learning
- Authors: Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, Eng Siong
Chng
- Abstract summary: We propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting.
Experimental results with fused features on two image-text classification benchmarks show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods.
- Score: 29.237813880311943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning aims to imitate human beings to acquire complementary
information from multiple modalities for various downstream tasks. However,
traditional aggregation-based multimodal fusion methods ignore the
inter-modality relationship, treat each modality equally, suffer sensor noise,
and thus reduce multimodal learning performance. In this work, we propose a
novel multimodal contrastive method to explore more reliable multimodal
representations under the weak supervision of unimodal predicting.
Specifically, we first capture task-related unimodal representations and the
unimodal predictions from the introduced unimodal predicting task. Then the
unimodal representations are aligned with the more effective one by the
designed multimodal contrastive method under the supervision of the unimodal
predictions. Experimental results with fused features on two image-text
classification benchmarks UPMC-Food-101 and N24News show that our proposed
Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method
outperforms current state-of-the-art multimodal methods. The detailed ablation
study and analysis further demonstrate the advantage of our proposed method.
Related papers
- Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Improving Unimodal Inference with Multimodal Transformers [88.83765002648833]
Our approach involves a multi-branch architecture that incorporates unimodal models with a multimodal transformer-based branch.
By co-training these branches, the stronger multimodal branch can transfer its knowledge to the weaker unimodal branches through a multi-task objective.
We evaluate our approach on tasks of dynamic hand gesture recognition based on RGB and Depth, audiovisual emotion recognition based on speech and facial video, and audio-videotext based sentiment analysis.
arXiv Detail & Related papers (2023-11-16T19:53:35Z) - Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task
Learning with Auxiliary Mutual Information Maximization [2.4660652494309936]
Multimodal representation learning poses significant challenges.
Existing methods often struggle to exploit the unique characteristics of each modality.
In this study, we propose Self-MI in the self-supervised learning fashion.
arXiv Detail & Related papers (2023-11-07T08:10:36Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic
Fusion Prompts [30.15646658460899]
Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media.
Existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect.
We propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario.
arXiv Detail & Related papers (2022-11-12T08:10:35Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.