Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning
- URL: http://arxiv.org/abs/2011.05585v1
- Date: Wed, 11 Nov 2020 06:18:31 GMT
- Title: Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning
- Authors: Jonathan Boigne, Biman Liyanage, Ted \"Ostrem
- Abstract summary: We propose a novel transfer learning method for speech emotion recognition.
With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel transfer learning method for speech emotion recognition
allowing us to obtain promising results when only few training data is
available. With as low as 125 examples per emotion class, we were able to reach
a higher accuracy than a strong baseline trained on 8 times more data. Our
method leverages knowledge contained in pre-trained speech representations
extracted from models trained on a more general self-supervised task which
doesn't require human annotations, such as the wav2vec model. We provide
detailed insights on the benefits of our approach by varying the training data
size, which can help labeling teams to work more efficiently. We compare
performance with other popular methods on the IEMOCAP dataset, a
well-benchmarked dataset among the Speech Emotion Recognition (SER) research
community. Furthermore, we demonstrate that results can be greatly improved by
combining acoustic and linguistic knowledge from transfer learning. We align
acoustic pre-trained representations with semantic representations from the
BERT model through an attention-based recurrent neural network. Performance
improves significantly when combining both modalities and scales with the
amount of data. When trained on the full IEMOCAP dataset, we reach a new
state-of-the-art of 73.9% unweighted accuracy (UA).
Related papers
- A Cross-Lingual Meta-Learning Method Based on Domain Adaptation for Speech Emotion Recognition [1.8377902806196766]
Best-performing speech models are trained on large amounts of data in the language they are meant to work for.
Most languages have sparse data, making training models challenging.
Our work explores the model's performance in limited data, specifically for speech emotion recognition.
arXiv Detail & Related papers (2024-10-06T21:33:51Z) - Searching for Effective Preprocessing Method and CNN-based Architecture with Efficient Channel Attention on Speech Emotion Recognition [0.0]
Speech emotion recognition (SER) classifies human emotions in speech with a computer model.
We propose a 6-layer convolutional neural network (CNN) model with efficient channel attention (ECA) to pursue an efficient model structure.
With the interactive emotional dyadic motion capture (IEMOCAP) dataset, increasing the frequency resolution in preprocessing emotional speech can improve emotion recognition performance.
arXiv Detail & Related papers (2024-09-06T03:17:25Z) - Self-Supervised Learning for Audio-Based Emotion Recognition [1.7598252755538808]
Self-supervised learning is a family of methods which can learn despite a scarcity of supervised labels.
We have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality.
We find that self-supervised learning consistently improves the performance of the model across all metrics.
arXiv Detail & Related papers (2023-07-23T14:40:50Z) - BERT WEAVER: Using WEight AVERaging to enable lifelong learning for
transformer-based models in biomedical semantic search engines [49.75878234192369]
We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model.
We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once.
arXiv Detail & Related papers (2022-02-21T10:34:41Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z) - ALICE: Active Learning with Contrastive Natural Language Explanations [69.03658685761538]
We propose Active Learning with Contrastive Explanations (ALICE) to improve data efficiency in learning.
ALICE learns to first use active learning to select the most informative pairs of label classes to elicit contrastive natural language explanations.
It extracts knowledge from these explanations using a semantically extracted knowledge.
arXiv Detail & Related papers (2020-09-22T01:02:07Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.