Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal
Sentiment Analysis
- URL: http://arxiv.org/abs/2109.01797v1
- Date: Sat, 4 Sep 2021 06:04:21 GMT
- Title: Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal
Sentiment Analysis
- Authors: Sijie Mai, Ying Zeng, Shuangjia Zheng, Haifeng Hu
- Abstract summary: We propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation.
Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning.
Our proposed method outperforms existing works.
- Score: 18.4364234071951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The wide application of smart devices enables the availability of multimodal
data, which can be utilized in many tasks. In the field of multimodal sentiment
analysis (MSA), most previous works focus on exploring intra- and inter-modal
interactions. However, training a network with cross-modal information
(language, visual, audio) is still challenging due to the modality gap, and
existing methods still cannot ensure to sufficiently learn intra-/inter-modal
dynamics. Besides, while learning dynamics within each sample draws great
attention, the learning of inter-class relationships is neglected. Moreover,
the size of datasets limits the generalization ability of existing methods. To
address the afore-mentioned issues, we propose a novel framework HyCon for
hybrid contrastive learning of tri-modal representation. Specifically, we
simultaneously perform intra-/inter-modal contrastive learning and
semi-contrastive learning (that is why we call it hybrid contrastive learning),
with which the model can fully explore cross-modal interactions, preserve
inter-class relationships and reduce the modality gap. Besides, a refinement
term is devised to prevent the model falling into a sub-optimal solution.
Moreover, HyCon can naturally generate a large amount of training pairs for
better generalization and reduce the negative effect of limited datasets.
Extensive experiments on public datasets demonstrate that our proposed method
outperforms existing works.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Detached and Interactive Multimodal Learning [17.843121072628477]
This paper introduces DI-MML, a novel detached MML framework designed to learn complementary information across modalities.
It addresses competition by separately training each modality encoder with isolated learning objectives.
Experiments conducted on audio-visual, flow-image, and front-rear view datasets show the superior performance of our proposed method.
arXiv Detail & Related papers (2024-07-28T15:38:58Z) - Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning [23.035725779568587]
We study the role and interactions of multiple modalities in mitigating forgetting in deep neural networks (DNNs)
Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations.
We propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality.
arXiv Detail & Related papers (2024-05-04T22:02:58Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - UNIMO-3: Multi-granularity Interaction for Vision-Language
Representation Learning [35.88753097105914]
We propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction.
Our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.
arXiv Detail & Related papers (2023-05-23T05:11:34Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.