LanSER: Language-Model Supported Speech Emotion Recognition
- URL: http://arxiv.org/abs/2309.03978v1
- Date: Thu, 7 Sep 2023 19:21:08 GMT
- Title: LanSER: Language-Model Supported Speech Emotion Recognition
- Authors: Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian
Eoff, Brendan Jou
- Abstract summary: We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models.
For inferring weak labels constrained to a taxonomy, we use a textual entailment approach that selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition.
Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and show improved label efficiency.
- Score: 25.597250907836152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech emotion recognition (SER) models typically rely on costly
human-labeled data for training, making scaling methods to large speech
datasets and nuanced emotion taxonomies difficult. We present LanSER, a method
that enables the use of unlabeled data by inferring weak emotion labels via
pre-trained large language models through weakly-supervised learning. For
inferring weak labels constrained to a taxonomy, we use a textual entailment
approach that selects an emotion label with the highest entailment score for a
speech transcript extracted via automatic speech recognition. Our experimental
results show that models pre-trained on large datasets with this weak
supervision outperform other baseline models on standard SER datasets when
fine-tuned, and show improved label efficiency. Despite being pre-trained on
labels derived only from text, we show that the resulting representations
appear to model the prosodic content of speech.
Related papers
- Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Vision-language Assisted Attribute Learning [53.60196963381315]
Attribute labeling at large scale is typically incomplete and partial.
Existing attribute learning methods often treat the missing labels as negative or simply ignore them all during training.
We leverage the available vision-language knowledge to explicitly disclose the missing labels for enhancing model learning.
arXiv Detail & Related papers (2023-12-12T06:45:19Z) - Context Unlocks Emotions: Text-based Emotion Classification Dataset
Auditing with Large Language Models [23.670143829183104]
The lack of contextual information in text data can make the annotation process of text-based emotion classification datasets challenging.
We propose a formal definition of textual context to motivate a prompting strategy to enhance such contextual information.
Our method improves alignment between inputs and their human-annotated labels from both an empirical and human-evaluated standpoint.
arXiv Detail & Related papers (2023-11-06T21:34:49Z) - Leveraging Label Information for Multimodal Emotion Recognition [22.318092635089464]
Multimodal emotion recognition (MER) aims to detect the emotional status of a given expression by combining the speech and text information.
We propose a novel approach for MER by leveraging label information.
We devise a novel label-guided attentive fusion module to fuse the label-aware text and speech representations for emotion classification.
arXiv Detail & Related papers (2023-09-05T10:26:32Z) - Self-Supervised Learning for Audio-Based Emotion Recognition [1.7598252755538808]
Self-supervised learning is a family of methods which can learn despite a scarcity of supervised labels.
We have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality.
We find that self-supervised learning consistently improves the performance of the model across all metrics.
arXiv Detail & Related papers (2023-07-23T14:40:50Z) - Detecting Label Errors using Pre-Trained Language Models [37.82128817976385]
We show that large pre-trained language models are extremely capable of identifying label errors in datasets.
We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP.
arXiv Detail & Related papers (2022-05-25T11:59:39Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.