Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis
- URL: http://arxiv.org/abs/2210.14556v1
- Date: Wed, 26 Oct 2022 08:24:15 GMT
- Title: Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis
- Authors: Ronghao Lin, Haifeng Hu
- Abstract summary: We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
- Score: 19.07020276666615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal representation learning is a challenging task in which previous
work mostly focus on either uni-modality pre-training or cross-modality fusion.
In fact, we regard modeling multimodal representation as building a skyscraper,
where laying stable foundation and designing the main structure are equally
essential. The former is like encoding robust uni-modal representation while
the later is like integrating interactive information among different
modalities, both of which are critical to learning an effective multimodal
representation. Recently, contrastive learning has been successfully applied in
representation learning, which can be utilized as the pillar of the skyscraper
and benefit the model to extract the most important features contained in the
multimodal data. In this paper, we propose a novel framework named MultiModal
Contrastive Learning (MMCL) for multimodal representation to capture intra- and
inter-modality dynamics simultaneously. Specifically, we devise uni-modal
contrastive coding with an efficient uni-modal feature augmentation strategy to
filter inherent noise contained in acoustic and visual modality and acquire
more robust uni-modality representations. Besides, a pseudo siamese network is
presented to predict representation across different modalities, which
successfully captures cross-modal dynamics. Moreover, we design two contrastive
learning tasks, instance- and sentiment-based contrastive learning, to promote
the process of prediction and learn more interactive information related to
sentiment. Extensive experiments conducted on two public datasets demonstrate
that our method surpasses the state-of-the-art methods.
Related papers
- On the Comparison between Multi-modal and Single-modal Contrastive Learning [50.74988548106031]
We introduce a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning.
We identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning.
Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning.
arXiv Detail & Related papers (2024-11-05T06:21:17Z) - Turbo your multi-modal classification with contrastive learning [17.983460380784337]
In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding.
Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality.
With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training.
arXiv Detail & Related papers (2024-09-14T03:15:34Z) - Improving Unimodal Inference with Multimodal Transformers [88.83765002648833]
Our approach involves a multi-branch architecture that incorporates unimodal models with a multimodal transformer-based branch.
By co-training these branches, the stronger multimodal branch can transfer its knowledge to the weaker unimodal branches through a multi-task objective.
We evaluate our approach on tasks of dynamic hand gesture recognition based on RGB and Depth, audiovisual emotion recognition based on speech and facial video, and audio-videotext based sentiment analysis.
arXiv Detail & Related papers (2023-11-16T19:53:35Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.