Related papers: Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

URL: http://arxiv.org/abs/2210.15824v1
Date: Fri, 28 Oct 2022 01:25:16 GMT
Title: Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis
Authors: Peipei Liu, Xin Zheng, Hong Li, Jie Liu, Yimo Ren, Hongsong Zhu, Limin Sun
Abstract summary: This study investigates the improvement approaches of modality representation with contrastive learning. We devise a three-stages framework with multi-view contrastive learning to refine representations for the specific objectives. We conduct experiments on three open datasets, and results show the advance of our model.
Score: 15.623293264871181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modality representation learning is an important problem for multimodal sentiment analysis (MSA), since the highly distinguishable representations can contribute to improving the analysis effect. Previous works of MSA have usually focused on multimodal fusion strategies, and the deep study of modal representation learning was given less attention. Recently, contrastive learning has been confirmed effective at endowing the learned representation with stronger discriminate ability. Inspired by this, we explore the improvement approaches of modality representation with contrastive learning in this study. To this end, we devise a three-stages framework with multi-view contrastive learning to refine representations for the specific objectives. At the first stage, for the improvement of unimodal representations, we employ the supervised contrastive learning to pull samples within the same class together while the other samples are pushed apart. At the second stage, a self-supervised contrastive learning is designed for the improvement of the distilled unimodal representations after cross-modal interaction. At last, we leverage again the supervised contrastive learning to enhance the fused multimodal representation. After all the contrast trainings, we next achieve the classification task based on frozen representations. We conduct experiments on three open datasets, and results show the advance of our model.

Related papers

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs [28.20725794099928]
We present UniME, a novel framework that learns discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning.
arXiv Detail & Related papers (2025-04-24T10:51:52Z)
On the Comparison between Multi-modal and Single-modal Contrastive Learning [50.74988548106031]
We introduce a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. We identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning.
arXiv Detail & Related papers (2024-11-05T06:21:17Z)
Revealing Multimodal Contrastive Representation Learning through Latent Partial Causal Models [85.67870425656368]
We introduce a unified causal model specifically designed for multimodal data. We show that multimodal contrastive representation learning excels at identifying latent coupled variables. Experiments demonstrate the robustness of our findings, even when the assumptions are violated.
arXiv Detail & Related papers (2024-02-09T07:18:06Z)
Constrained Multiview Representation for Self-supervised Contrastive Learning [4.817827522417457]
We introduce a novel approach predicated on representation distance-based mutual information (MI) for measuring the significance of different views. We harness multi-view representations extracted from the frequency domain, re-evaluating their significance based on mutual information.
arXiv Detail & Related papers (2024-02-05T19:09:33Z)
Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation [10.44888349041063]
We introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis. This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality.
arXiv Detail & Related papers (2023-12-04T02:58:19Z)
I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal Mutual Distillation [147.2183428328396]
We introduce a general Inter- and Intra-modal Mutual Distillation (I$2$MD) framework. In I$2$MD, we first re-formulate the cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the Intra-modal Mutual Distillation (IMD) strategy.
arXiv Detail & Related papers (2023-10-24T07:22:17Z)
Identifiability Results for Multimodal Contrastive Learning [72.15237484019174]
We show that it is possible to recover shared factors in a more general setup than the multi-view setting studied previously. Our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.
arXiv Detail & Related papers (2023-03-16T09:14:26Z)
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)
A Broad Study on the Transferability of Visual Representations with Contrastive Learning [15.667240680328922]
We study the transferability of learned representations of contrastive approaches for linear evaluation, full-network transfer, and few-shot recognition. The results show that the contrastive approaches learn representations that are easily transferable to a different downstream task. Our analysis reveals that the representations learned from the contrastive approaches contain more low/mid-level semantics than cross-entropy models.
arXiv Detail & Related papers (2021-03-24T22:55:04Z)
Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images. We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.