Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning
- URL: http://arxiv.org/abs/2409.00143v2
- Date: Wed, 11 Sep 2024 04:44:06 GMT
- Title: Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning
- Authors: Guoyang Xu, Junqi Xue, Yuxin Liu, Zirui Wang, Min Zhang, Zhenxi Song, Zhiguo Zhang,
- Abstract summary: Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions.
Existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise.
We propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics.
- Score: 22.54577327204281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions. However, existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. To address this issue, we propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics, thus enhancing the quality of the representations and the robustness of the model. To fully exploit the rich semantic information in textual knowledge, we propose a semantic-guided fusion module. By evaluating the correlations between different modalities, this module facilitates cross-modal interactions gated by modality-invariant representations. Furthermore, we introduce a modality discriminator to disentangle modality-invariant and modality-specific subspaces. Experimental results on two public datasets demonstrate the superiority of our model. Our code is available at https://github.com/X-G-Y/SATI.
Related papers
- Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - From Orthogonality to Dependency: Learning Disentangled Representation for Multi-Modal Time-Series Sensing Signals [27.95734153126108]
Existing methods for representation learning aim to disentangle the modality-shared and modality-specific latent variables.
We propose a general generation process, where the modality-shared and modality-specific latent variables are dependent.
Our textbfMATE model is built on a temporally variational inference architecture with the modality-shared and modality-specific prior networks.
arXiv Detail & Related papers (2024-05-25T06:26:02Z) - FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals
in Factorized Orthogonal Latent Space [7.324708513042455]
This paper proposes a novel contrastive learning framework, called FOCAL, for extracting comprehensive features from multimodal time-series sensing signals.
It consistently outperforms the state-of-the-art baselines in downstream tasks with a clear margin.
arXiv Detail & Related papers (2023-10-30T22:55:29Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Interpretable Time-series Representation Learning With Multi-Level
Disentanglement [56.38489708031278]
Disentangle Time Series (DTS) is a novel disentanglement enhancement framework for sequential data.
DTS generates hierarchical semantic concepts as the interpretable and disentangled representation of time-series.
DTS achieves superior performance in downstream applications, with high interpretability of semantic concepts.
arXiv Detail & Related papers (2021-05-17T22:02:24Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.