DWFormer: Dynamic Window transFormer for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2303.01694v1
- Date: Fri, 3 Mar 2023 03:26:53 GMT
- Title: DWFormer: Dynamic Window transFormer for Speech Emotion Recognition
- Authors: Shuaiqi Chen, Xiaofen Xing, Weibin Zhang, Weidong Chen, Xiangmin Xu
- Abstract summary: We propose Dynamic Window transFormer (DWFormer) to locate important regions at different temporal scales.
DWFormer is evaluated on both the IEMOCAP and the MELD datasets.
Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.
- Score: 16.07391331544217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech emotion recognition is crucial to human-computer interaction. The
temporal regions that represent different emotions scatter in different parts
of the speech locally. Moreover, the temporal scales of important information
may vary over a large range within and across speech segments. Although
transformer-based models have made progress in this field, the existing models
could not precisely locate important regions at different temporal scales. To
address the issue, we propose Dynamic Window transFormer (DWFormer), a new
architecture that leverages temporal importance by dynamically splitting
samples into windows. Self-attention mechanism is applied within windows for
capturing temporal important information locally in a fine-grained way.
Cross-window information interaction is also taken into account for global
communication. DWFormer is evaluated on both the IEMOCAP and the MELD datasets.
Experimental results show that the proposed model achieves better performance
than the previous state-of-the-art methods.
Related papers
- AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - A Transformer-Based Model With Self-Distillation for Multimodal Emotion
Recognition in Conversations [15.77747948751497]
We propose a transformer-based model with self-distillation (SDT) for the task.
The proposed model captures intra- and inter-modal interactions by utilizing intra- and inter-modal transformers.
We introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality.
arXiv Detail & Related papers (2023-10-31T14:33:30Z) - Disentangled Variational Autoencoder for Emotion Recognition in
Conversations [14.92924920489251]
We propose a VAD-disentangled Variational AutoEncoder (VAD-VAE) for Emotion Recognition in Conversations (ERC)
VAD-VAE disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space.
Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets.
arXiv Detail & Related papers (2023-05-23T13:50:06Z) - Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach
for Speech Emotion Recognition [23.13759265661777]
Speech emotion recognition (SER) plays a vital role in improving interactions between humans and machines.
We introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI- Multi-scale Network (TIM-Net)
arXiv Detail & Related papers (2022-11-14T13:35:01Z) - MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion
Recognition [6.108523790270448]
We present a novel Multi Scale Adaptive Graph Convolution Network (MSA-GCN) to recognize emotions.
In our model, a adaptive selective spatial-temporal convolution is designed to select the convolution kernel dynamically to obtain the soft-temporal features of different emotions.
Compared with previous state-of-the-art methods, the proposed method achieves the best performance on two public datasets.
arXiv Detail & Related papers (2022-09-19T13:07:16Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Multi-Window Data Augmentation Approach for Speech Emotion Recognition [58.987211083697645]
We present a Multi-Window Data Augmentation (MWA-SER) approach for speech emotion recognition.
MWA-SER is a unimodal approach that focuses on two key concepts; designing the speech augmentation method and building the deep learning model.
We show that our augmentation method, combined with a deep learning model, improves speech emotion recognition performance.
arXiv Detail & Related papers (2020-10-19T22:15:03Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.