A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion
Recognition, Speaker Verification and Spoken Language Understanding
- URL: http://arxiv.org/abs/2111.02735v1
- Date: Thu, 4 Nov 2021 10:39:06 GMT
- Title: A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion
Recognition, Speaker Verification and Spoken Language Understanding
- Authors: Yingzhi Wang, Abdelmoumene Boumadane and Abdelwahab Heba
- Abstract summary: We explore partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks.
With simple down-stream frameworks, the best scores reach 79.58% weighted accuracy for Speech Emotion Recognition on IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 87.51% accuracy for Intent Classification and 75.32% F1 for Slot Filling on SLURP.
- Score: 0.9023847175654603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised speech representations such as wav2vec 2.0 and HuBERT are
making revolutionary progress in Automatic Speech Recognition (ASR). However,
self-supervised models have not been totally proved to produce better
performance on tasks other than ASR. In this work, we explore partial
fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models
for three non-ASR speech tasks : Speech Emotion Recognition, Speaker
Verification and Spoken Language Understanding. We also compare pre-trained
models with/without ASR fine-tuning. With simple down-stream frameworks, the
best scores reach 79.58% weighted accuracy for Speech Emotion Recognition on
IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 87.51%
accuracy for Intent Classification and 75.32% F1 for Slot Filling on SLURP,
thus setting a new state-of-the-art for these three benchmarks, proving that
fine-tuned wav2vec 2.0 and HuBERT models can better learn prosodic, voice-print
and semantic representations.
Related papers
- LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks [19.94790551312789]
A cost-effective self-supervised fine-tuning (SSFT) method named "LASER: Learning by Aligning Self-supervised Representations" is presented.
Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR)
A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are observed, for the ASR and PR tasks respectively, with only 3 hours of fine-tuning on a single GPU.
arXiv Detail & Related papers (2024-06-13T14:17:47Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - MelHuBERT: A simplified HuBERT on Mel spectrograms [55.608981341747246]
We revisit the training of HuBERT, a highly successful self-supervised model.
We improve and simplify several key components, including the loss function, input representation, and training in multiple stages.
Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition.
arXiv Detail & Related papers (2022-11-17T23:38:29Z) - Do We Still Need Automatic Speech Recognition for Spoken Language
Understanding? [14.575551366682872]
We show that learned speech features are superior to ASR transcripts on three classification tasks.
We highlight the intrinsic robustness of wav2vec 2.0 representations to out-of-vocabulary words as key to better performance.
arXiv Detail & Related papers (2021-11-29T15:13:36Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based
on BAVED Dataset [0.0]
This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues.
The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT.
The experiment and performance results of our model overcome the previous known outcomes.
arXiv Detail & Related papers (2021-10-09T00:58:12Z) - Multi-task Voice-Activated Framework using Self-supervised Learning [0.9864260997723973]
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data.
We propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks.
arXiv Detail & Related papers (2021-10-03T19:28:57Z) - HuBERT: Self-Supervised Speech Representation Learning by Masked
Prediction of Hidden Units [81.53783563025084]
We propose an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
A key ingredient of our approach is applying the prediction loss over the masked regions only.
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
arXiv Detail & Related papers (2021-06-14T14:14:28Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.