AM^2-EmoJE: Adaptive Missing-Modality Emotion Recognition in
Conversation via Joint Embedding Learning
- URL: http://arxiv.org/abs/2402.10921v1
- Date: Fri, 26 Jan 2024 19:57:26 GMT
- Title: AM^2-EmoJE: Adaptive Missing-Modality Emotion Recognition in
Conversation via Joint Embedding Learning
- Authors: Naresh Kumar Devulapally, Sidharth Anand, Sreyasee Das Bhattacharjee,
Junsong Yuan
- Abstract summary: We propose AM2-EmoJE, a model for Adaptive Missing-Modality Emotion Recognition in Conversation via Joint Embedding Learning model.
By leveraging thetemporal details at the dialogue level, the proposed AM2-EmoJE not only demonstrates superior performance compared to the best-performing state-of-the-art multimodal methods.
- Score: 42.69642087199678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human emotion can be presented in different modes i.e., audio, video, and
text. However, the contribution of each mode in exhibiting each emotion is not
uniform. Furthermore, the availability of complete mode-specific details may
not always be guaranteed in the test time. In this work, we propose AM^2-EmoJE,
a model for Adaptive Missing-Modality Emotion Recognition in Conversation via
Joint Embedding Learning model that is grounded on two-fold contributions:
First, a query adaptive fusion that can automatically learn the relative
importance of its mode-specific representations in a query-specific manner. By
this the model aims to prioritize the mode-invariant spatial query details of
the emotion patterns, while also retaining its mode-exclusive aspects within
the learned multimodal query descriptor. Second the multimodal joint embedding
learning module that explicitly addresses various missing modality scenarios in
test-time. By this, the model learns to emphasize on the correlated patterns
across modalities, which may help align the cross-attended mode-specific
descriptors pairwise within a joint-embedding space and thereby compensate for
missing modalities during inference. By leveraging the spatio-temporal details
at the dialogue level, the proposed AM^2-EmoJE not only demonstrates superior
performance compared to the best-performing state-of-the-art multimodal
methods, by effectively leveraging body language in place of face expression,
it also exhibits an enhanced privacy feature. By reporting around 2-5%
improvement in the weighted-F1 score, the proposed multimodal joint embedding
module facilitates an impressive performance gain in a variety of
missing-modality query scenarios during test time.
Related papers
- Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning [22.54577327204281]
Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions.
Existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise.
We propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics.
arXiv Detail & Related papers (2024-08-30T03:28:40Z) - Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.