A vector quantized masked autoencoder for speech emotion recognition
- URL: http://arxiv.org/abs/2304.11117v1
- Date: Fri, 21 Apr 2023 16:37:57 GMT
- Title: A vector quantized masked autoencoder for speech emotion recognition
- Authors: Samir Sadok, Simon Leglaive, Renaud S\'eguier
- Abstract summary: We propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals.
Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset, outperforms an MAE working on the raw spectrogram representation.
- Score: 3.985839436158186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have seen remarkable progress in speech emotion recognition
(SER), thanks to advances in deep learning techniques. However, the limited
availability of labeled data remains a significant challenge in the field.
Self-supervised learning has recently emerged as a promising solution to
address this challenge. In this paper, we propose the vector quantized masked
autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned
to recognize emotions from speech signals. The VQ-MAE-S model is based on a
masked autoencoder (MAE) that operates in the discrete latent space of a
vector-quantized variational autoencoder. Experimental results show that the
proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on
emotional speech data, outperforms an MAE working on the raw spectrogram
representation and other state-of-the-art methods in SER.
Related papers
- Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge.
This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.
We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z) - Boosting Continuous Emotion Recognition with Self-Pretraining using Masked Autoencoders, Temporal Convolutional Networks, and Transformers [3.951847822557829]
We tackle the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge.
Our study advocates a novel approach aimed at refining continuous emotion recognition.
We achieve this by pre-training with Masked Autoencoders (MAE) on facial datasets, followed by fine-tuning on the aff-wild2 dataset annotated with expression (Expr) labels.
arXiv Detail & Related papers (2024-03-18T03:28:01Z) - Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using
Cochlear Cepstrum-based Masking for Speech Emotion Recognition [5.974778743092437]
CochCeps-Augment is a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations.
Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis.
arXiv Detail & Related papers (2024-02-10T11:13:13Z) - A vector quantized masked autoencoder for audiovisual speech emotion recognition [5.8641712963450825]
The paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning.
A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence.
Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods.
arXiv Detail & Related papers (2023-05-05T14:19:46Z) - A Survey on Masked Autoencoder for Self-supervised Learning in Vision
and Beyond [64.85076239939336]
Self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP.
generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP.
Success of mask image modeling has revived the masking autoencoder.
arXiv Detail & Related papers (2022-07-30T09:59:28Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.