Active Learning Based Fine-Tuning Framework for Speech Emotion
Recognition
- URL: http://arxiv.org/abs/2310.00283v1
- Date: Sat, 30 Sep 2023 07:23:29 GMT
- Title: Active Learning Based Fine-Tuning Framework for Speech Emotion
Recognition
- Authors: Dongyuan Li, Yusong Wang, Kotaro Funakoshi, Manabu Okumura
- Abstract summary: Speech emotion recognition (SER) has drawn increasing attention for its applications in human-machine interaction.
Existing SER methods ignore the information gap between the pre-training speech recognition task and the downstream SER task, leading to sub-optimal performance.
We propose an active learning (AL) based Fine-Tuning framework for SER that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency.
- Score: 20.28850074164053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech emotion recognition (SER) has drawn increasing attention for its
applications in human-machine interaction. However, existing SER methods ignore
the information gap between the pre-training speech recognition task and the
downstream SER task, leading to sub-optimal performance. Moreover, they require
much time to fine-tune on each specific speech dataset, restricting their
effectiveness in real-world scenes with large-scale noisy data. To address
these issues, we propose an active learning (AL) based Fine-Tuning framework
for SER that leverages task adaptation pre-training (TAPT) and AL methods to
enhance performance and efficiency. Specifically, we first use TAPT to minimize
the information gap between the pre-training and the downstream task. Then, AL
methods are used to iteratively select a subset of the most informative and
diverse samples for fine-tuning, reducing time consumption. Experiments
demonstrate that using only 20\%pt. samples improves 8.45\%pt. accuracy and
reduces 79\%pt. time consumption.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Adaptive Rentention & Correction for Continual Learning [114.5656325514408]
A common problem in continual learning is the classification layer's bias towards the most recent task.
We name our approach Adaptive Retention & Correction (ARC)
ARC achieves an average performance increase of 2.7% and 2.6% on the CIFAR-100 and Imagenet-R datasets.
arXiv Detail & Related papers (2024-05-23T08:43:09Z) - Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition [17.59356583727259]
Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications.
We propose an active learning (AL)-based fine-tuning framework for SER, called textscAfter.
Our proposed method improves accuracy by 8.45% and reduces time consumption by 79%.
arXiv Detail & Related papers (2024-05-01T04:05:29Z) - Efficient Cross-Task Prompt Tuning for Few-Shot Conversational Emotion
Recognition [6.988000604392974]
Emotion Recognition in Conversation (ERC) has been widely studied due to its importance in developing emotion-aware empathetic machines.
We propose a derivative-free optimization method called Cross-Task Prompt Tuning (CTPT) for few-shot conversational emotion recognition.
arXiv Detail & Related papers (2023-10-23T06:46:03Z) - Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training [20.98770732015944]
Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data.
We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected.
To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
arXiv Detail & Related papers (2023-06-08T15:26:52Z) - Instance-wise Prompt Tuning for Pretrained Language Models [72.74916121511662]
Instance-wise Prompt Tuning (IPT) is the first prompt learning paradigm that injects knowledge from the input data instances to the prompts.
IPT significantly outperforms task-based prompt learning methods, and achieves comparable performance to conventional finetuning with only 0.5% - 1.5% of tuned parameters.
arXiv Detail & Related papers (2022-06-04T10:08:50Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Weighted Training for Cross-Task Learning [71.94908559469475]
We introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning.
We show that TAWT is easy to implement, is computationally efficient, requires little hyper parameter tuning, and enjoys non-asymptotic learning-theoretic guarantees.
As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning.
arXiv Detail & Related papers (2021-05-28T20:27:02Z) - Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition.
With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.