A Comparative Study of Pre-trained Speech and Audio Embeddings for
Speech Emotion Recognition
- URL: http://arxiv.org/abs/2304.11472v1
- Date: Sat, 22 Apr 2023 19:56:35 GMT
- Title: A Comparative Study of Pre-trained Speech and Audio Embeddings for
Speech Emotion Recognition
- Authors: Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma
- Abstract summary: Speech Emotion Recognition (SER) has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning.
Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks.
We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms on the derived embeddings.
The results of our study indicate that the best performance is achieved by algorithms trained on embeddings
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained models (PTMs) have shown great promise in the speech and audio
domain. Embeddings leveraged from these models serve as inputs for learning
algorithms with applications in various downstream tasks. One such crucial task
is Speech Emotion Recognition (SER) which has a wide range of applications,
including dynamic analysis of customer calls, mental health assessment, and
personalized language learning. PTM embeddings have helped advance SER,
however, a comprehensive comparison of these PTM embeddings that consider
multiple facets such as embedding model architecture, data used for
pre-training, and the pre-training procedure being followed is missing. A
thorough comparison of PTM embeddings will aid in the faster and more efficient
development of models and enable their deployment in real-world scenarios. In
this work, we exploit this research gap and perform a comparative analysis of
embeddings extracted from eight speech and audio PTMs (wav2vec 2.0, data2vec,
wavLM, UniSpeech-SAT, wav2clip, YAMNet, x-vector, ECAPA). We perform an
extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS,
SAVEE, Emo-DB) by training three algorithms (XGBoost, Random Forest, FCN) on
the derived embeddings. The results of our study indicate that the best
performance is achieved by algorithms trained on embeddings derived from PTMs
trained for speaker recognition followed by wav2clip and UniSpeech-SAT. This
can relay that the top performance by embeddings from speaker recognition PTMs
is most likely due to the model taking up information about numerous speech
features such as tone, accent, pitch, and so on during its speaker recognition
training. Insights from this work will assist future studies in their selection
of embeddings for applications related to SER.
Related papers
- Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.
The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.
We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.