Exploring Wav2vec 2.0 fine-tuning for improved speech emotion
recognition
- URL: http://arxiv.org/abs/2110.06309v1
- Date: Tue, 12 Oct 2021 19:55:55 GMT
- Title: Exploring Wav2vec 2.0 fine-tuning for improved speech emotion
recognition
- Authors: Li-Wei Chen and Alexander Rudnicky
- Abstract summary: wav2vec 2.0 can be used for speech emotion recognition (SER)
Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented.
We show V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset.
We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations.
- Score: 78.92428622630861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While wav2vec 2.0 has been proposed for speech recognition (ASR), it can also
be used for speech emotion recognition (SER); its performance can be
significantly improved using different fine-tuning strategies. Two baseline
methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are
first presented. We show that V-FT is able to outperform state-of-the-art
models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy,
further improves the performance on SER. We also introduce a novel fine-tuning
method termed P-TAPT, which modifies the TAPT objective to learn contextualized
emotion representations. Experiments show that P-TAPT performs better than TAPT
especially under low-resource settings. Compared to prior works in this
literature, our top-line system achieved a 7.4% absolute improvement on
unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our
code is publicly available.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations [1.6008229267455227]
We propose a multi-view SSL pre-training technique that can be applied to various representations of speech, including the ones generated by large speech models.
Our experiments, based on wav2vec 2.0, spectral and paralinguistic features, demonstrate that the proposed framework boosts the SER performance, by up to 10% in Unweighted Average Recall.
arXiv Detail & Related papers (2024-06-12T06:06:55Z) - Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings [4.266613351203219]
We study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain.
We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%.
arXiv Detail & Related papers (2024-05-15T06:59:33Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - AutoVP: An Automated Visual Prompting Framework and Benchmark [66.5618543577204]
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks.
We propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks.
Our experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin.
arXiv Detail & Related papers (2023-10-12T14:55:31Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z) - Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings [16.829474982595837]
We propose a transfer learning method for speech emotion recognition.
We combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model.
We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.
arXiv Detail & Related papers (2021-04-08T04:31:58Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.