Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based
on BAVED Dataset
- URL: http://arxiv.org/abs/2110.04425v1
- Date: Sat, 9 Oct 2021 00:58:12 GMT
- Title: Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based
on BAVED Dataset
- Authors: Omar Mohamed and Salah A. Aly
- Abstract summary: This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues.
The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT.
The experiment and performance results of our model overcome the previous known outcomes.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there have been tremendous research outcomes in the fields of
speech recognition and natural language processing. This is due to the
well-developed multi-layers deep learning paradigms such as wav2vec2.0,
Wav2vecU, WavBERT, and HuBERT that provide better representation learning and
high information capturing. Such paradigms run on hundreds of unlabeled data,
then fine-tuned on a small dataset for specific tasks. This paper introduces a
deep learning constructed emotional recognition model for Arabic speech
dialogues. The developed model employs the state of the art audio
representations include wav2vec2.0 and HuBERT. The experiment and performance
results of our model overcome the previous known outcomes.
Related papers
- WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT
Based on the Quran Reciters Dataset [0.0]
We develop a deep learning model for Arabic speakers identification by using Wav2Vec2.0 and HuBERT audio representation learning tools.
The experiments ensure that an arbitrary wave signal for a certain speaker can be identified with 98% and 97.1% accuracies.
arXiv Detail & Related papers (2021-11-11T17:44:50Z) - A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion
Recognition, Speaker Verification and Spoken Language Understanding [0.9023847175654603]
We explore partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks.
With simple down-stream frameworks, the best scores reach 79.58% weighted accuracy for Speech Emotion Recognition on IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 87.51% accuracy for Intent Classification and 75.32% F1 for Slot Filling on SLURP.
arXiv Detail & Related papers (2021-11-04T10:39:06Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings [16.829474982595837]
We propose a transfer learning method for speech emotion recognition.
We combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model.
We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.
arXiv Detail & Related papers (2021-04-08T04:31:58Z) - Applying Wav2vec2.0 to Speech Recognition in Various Low-resource
Languages [16.001329145018687]
In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus.
However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English.
We apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
arXiv Detail & Related papers (2020-12-22T15:59:44Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations [51.25118580050847]
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods.
wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
arXiv Detail & Related papers (2020-06-20T02:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.