Speech Emotion Recognition using Self-Supervised Features
- URL: http://arxiv.org/abs/2202.03896v1
- Date: Mon, 7 Feb 2022 00:50:07 GMT
- Title: Speech Emotion Recognition using Self-Supervised Features
- Authors: Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno
and Hagai Aronowitz
- Abstract summary: We introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm.
Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed.
The proposed monomodal speechonly based system achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features.
- Score: 14.954994969217998
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-supervised pre-trained features have consistently delivered state-of-art
results in the field of natural language processing (NLP); however, their
merits in the field of speech emotion recognition (SER) still need further
investigation. In this paper we introduce a modular End-to- End (E2E) SER
system based on an Upstream + Downstream architecture paradigm, which allows
easy use/integration of a large variety of self-supervised features. Several
SER experiments for predicting categorical emotion classes from the IEMOCAP
dataset are performed. These experiments investigate interactions among
fine-tuning of self-supervised feature models, aggregation of frame-level
features into utterance-level features and back-end classification networks.
The proposed monomodal speechonly based system not only achieves SOTA results,
but also brings light to the possibility of powerful and well finetuned
self-supervised acoustic features that reach results similar to the results
achieved by SOTA multimodal systems using both Speech and Text modalities.
Related papers
- Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - End-to-end spoken language understanding using transformer networks and
self-supervised pre-trained features [17.407912171579852]
Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP)
We introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features.
arXiv Detail & Related papers (2020-11-16T19:30:52Z) - Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve
Multimodal Speech Emotion Recognition [9.099532309489996]
We show that jointly fine-tuning "BERT-like" SSL architectures achieve state-of-the-art (SOTA) results.
We also evaluate two methods of fusing speech and text modalities and show that a simple fusion mechanism can outperform more complex ones.
arXiv Detail & Related papers (2020-08-15T08:54:48Z) - A Dependency Syntactic Knowledge Augmented Interactive Architecture for
End-to-End Aspect-based Sentiment Analysis [73.74885246830611]
We propose a novel dependency syntactic knowledge augmented interactive architecture with multi-task learning for end-to-end ABSA.
This model is capable of fully exploiting the syntactic knowledge (dependency relations and types) by leveraging a well-designed Dependency Relation Embedded Graph Convolutional Network (DreGcn)
Extensive experimental results on three benchmark datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-04T14:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.