Leveraging Label Information for Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2309.02106v1
- Date: Tue, 5 Sep 2023 10:26:32 GMT
- Title: Leveraging Label Information for Multimodal Emotion Recognition
- Authors: Peiying Wang, Sunlu Zeng, Junqing Chen, Lu Fan, Meng Chen, Youzheng
Wu, Xiaodong He
- Abstract summary: Multimodal emotion recognition (MER) aims to detect the emotional status of a given expression by combining the speech and text information.
We propose a novel approach for MER by leveraging label information.
We devise a novel label-guided attentive fusion module to fuse the label-aware text and speech representations for emotion classification.
- Score: 22.318092635089464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal emotion recognition (MER) aims to detect the emotional status of a
given expression by combining the speech and text information. Intuitively,
label information should be capable of helping the model locate the salient
tokens/frames relevant to the specific emotion, which finally facilitates the
MER task. Inspired by this, we propose a novel approach for MER by leveraging
label information. Specifically, we first obtain the representative label
embeddings for both text and speech modalities, then learn the label-enhanced
text/speech representations for each utterance via label-token and label-frame
interactions. Finally, we devise a novel label-guided attentive fusion module
to fuse the label-aware text and speech representations for emotion
classification. Extensive experiments were conducted on the public IEMOCAP
dataset, and experimental results demonstrate that our proposed approach
outperforms existing baselines and achieves new state-of-the-art performance.
Related papers
- PVLR: Prompt-driven Visual-Linguistic Representation Learning for
Multi-Label Image Recognition [47.11517266162346]
We propose a Prompt-driven Visual-Linguistic Representation Learning framework to better leverage the capabilities of the linguistic modality.
In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features.
arXiv Detail & Related papers (2024-01-31T14:39:11Z) - HuBERTopic: Enhancing Semantic Representation of HuBERT through
Self-supervision Utilizing Topic Model [62.995175485416]
We propose a new approach to enrich the semantic representation of HuBERT.
An auxiliary topic classification task is added to HuBERT by using topic labels as teachers.
Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks.
arXiv Detail & Related papers (2023-10-06T02:19:09Z) - LanSER: Language-Model Supported Speech Emotion Recognition [25.597250907836152]
We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models.
For inferring weak labels constrained to a taxonomy, we use a textual entailment approach that selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition.
Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and show improved label efficiency.
arXiv Detail & Related papers (2023-09-07T19:21:08Z) - Description-Enhanced Label Embedding Contrastive Learning for Text
Classification [65.01077813330559]
Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task.
Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets.
external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning.
arXiv Detail & Related papers (2023-06-15T02:19:34Z) - Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - IDEA: Interactive DoublE Attentions from Label Embedding for Text
Classification [4.342189319523322]
We propose a novel model structure via siamese BERT and interactive double attentions named IDEA to capture the information exchange of text and label names.
Our proposed method outperforms the state-of-the-art methods using label texts significantly with more stable results.
arXiv Detail & Related papers (2022-09-23T04:50:47Z) - Tailor Versatile Multi-modal Learning for Multi-label Emotion
Recognition [7.280460748655983]
Multi-modal Multi-label Emotion Recognition (MMER) aims to identify various human emotions from heterogeneous visual, audio and text modalities.
Previous methods mainly focus on projecting multiple modalities into a common latent space and learning an identical representation for all labels.
We propose versatile multi-modAl learning for multI-labeL emOtion Recognition (TAILOR), aiming to refine multi-modal representations and enhance discriminative capacity of each label.
arXiv Detail & Related papers (2022-01-15T12:02:28Z) - Enhanced Language Representation with Label Knowledge for Span
Extraction [2.4909170697740963]
We introduce a new paradigm to integrate label knowledge and propose a novel model to explicitly and efficiently integrate label knowledge into text representations.
Specifically, it encodes texts and label annotations independently and then integrates label knowledge into text representation with an elaborate-designed semantics fusion module.
We conduct extensive experiments on three typical span extraction tasks: flat NER, nested NER, and event detection.
Our method achieves state-of-the-art performance on four benchmarks, and 2) reduces training time and inference time by 76% and 77% on average, respectively, compared with the QA Formalization paradigm.
arXiv Detail & Related papers (2021-11-01T12:21:05Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.