On the use of Self-supervised Pre-trained Acoustic and Linguistic
Features for Continuous Speech Emotion Recognition
- URL: http://arxiv.org/abs/2011.09212v1
- Date: Wed, 18 Nov 2020 11:10:29 GMT
- Title: On the use of Self-supervised Pre-trained Acoustic and Linguistic
Features for Continuous Speech Emotion Recognition
- Authors: Manon Macary, Marie Tahon, Yannick Est\`eve, Anthony Rousseau
- Abstract summary: We use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech.
To the authors' knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task.
- Score: 2.294014185517203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training for feature extraction is an increasingly studied approach to
get better continuous representations of audio and text content. In the present
work, we use wav2vec and camemBERT as self-supervised learned models to
represent our data in order to perform continuous emotion recognition from
speech (SER) on AlloSat, a large French emotional database describing the
satisfaction dimension, and on the state of the art corpus SEWA focusing on
valence, arousal and liking dimensions. To the authors' knowledge, this paper
presents the first study showing that the joint use of wav2vec and BERT-like
pre-trained features is very relevant to deal with continuous SER task, usually
characterized by a small amount of labeled training data. Evaluated by the
well-known concordance correlation coefficient (CCC), our experiments show that
we can reach a CCC value of 0.825 instead of 0.592 when using MFCC in
conjunction with word2vec word embedding on the AlloSat dataset.
Related papers
- Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Self-Relation Attention and Temporal Awareness for Emotion Recognition
via Vocal Burst [4.6193503399184275]
The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop & Competition.
By empirical experiments, our proposed method achieves a mean correlation coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline model.
arXiv Detail & Related papers (2022-09-15T22:06:42Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition.
With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z) - Two-stage Textual Knowledge Distillation for End-to-End Spoken Language
Understanding [18.275646344620387]
This work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning.
We push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting.
arXiv Detail & Related papers (2020-10-25T12:36:05Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.