Speech Emotion Recognition using Self-Supervised Features
- URL: http://arxiv.org/abs/2202.03896v1
- Date: Mon, 7 Feb 2022 00:50:07 GMT
- Title: Speech Emotion Recognition using Self-Supervised Features
- Authors: Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno
and Hagai Aronowitz
- Abstract summary: We introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm.
Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed.
The proposed monomodal speechonly based system achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features.
- Score: 14.954994969217998
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-supervised pre-trained features have consistently delivered state-of-art
results in the field of natural language processing (NLP); however, their
merits in the field of speech emotion recognition (SER) still need further
investigation. In this paper we introduce a modular End-to- End (E2E) SER
system based on an Upstream + Downstream architecture paradigm, which allows
easy use/integration of a large variety of self-supervised features. Several
SER experiments for predicting categorical emotion classes from the IEMOCAP
dataset are performed. These experiments investigate interactions among
fine-tuning of self-supervised feature models, aggregation of frame-level
features into utterance-level features and back-end classification networks.
The proposed monomodal speechonly based system not only achieves SOTA results,
but also brings light to the possibility of powerful and well finetuned
self-supervised acoustic features that reach results similar to the results
achieved by SOTA multimodal systems using both Speech and Text modalities.
Related papers
- The OCON model: an old but green solution for distributable supervised classification for acoustic monitoring in smart cities [0.28675177318965045]
This paper focuses on vowel phonemes classification and speakers recognition for the Automatic Speech Recognition domain.
For our case-study, the ASR model runs on a proprietary sensing and lightning system, exploited to monitor acoustic and air pollution on urban streets.
We formalize combinations of pseudo-Neural Architecture Search and Hyper-s Tuning experiments, using an informed grid-search methodology, to achieve classification accuracy comparable to nowadays most complex architectures.
arXiv Detail & Related papers (2024-10-05T09:47:54Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - Versatile audio-visual learning for emotion recognition [28.26077129002198]
This study proposes a versatile audio-visual learning framework for handling unimodal and multimodal systems.
We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task.
Notably, VAVL attains a new state-of-the-art performance in the emotional prediction task on the MSP-IMPROV corpus.
arXiv Detail & Related papers (2023-05-12T03:13:37Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve
Multimodal Speech Emotion Recognition [9.099532309489996]
We show that jointly fine-tuning "BERT-like" SSL architectures achieve state-of-the-art (SOTA) results.
We also evaluate two methods of fusing speech and text modalities and show that a simple fusion mechanism can outperform more complex ones.
arXiv Detail & Related papers (2020-08-15T08:54:48Z) - A Dependency Syntactic Knowledge Augmented Interactive Architecture for
End-to-End Aspect-based Sentiment Analysis [73.74885246830611]
We propose a novel dependency syntactic knowledge augmented interactive architecture with multi-task learning for end-to-end ABSA.
This model is capable of fully exploiting the syntactic knowledge (dependency relations and types) by leveraging a well-designed Dependency Relation Embedded Graph Convolutional Network (DreGcn)
Extensive experimental results on three benchmark datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-04T14:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.