Related papers: Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

URL: http://arxiv.org/abs/2309.04849v2
Date: Thu, 14 Mar 2024 21:46:37 GMT
Title: Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations
Authors: Debaditya Shome, Ali Etemad,
Abstract summary: EmoDistill is a novel framework to learn strong linguistic and prosodic representations of emotion from speech. Our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin.
Score: 23.4909421082857
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.

Related papers

Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025 [64.59170359368699]
We present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge.<n>Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues.
arXiv Detail & Related papers (2025-06-02T13:46:02Z)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z)
DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z)
Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data [0.0]
We propose a method for speech-to-speech emotion translation that operates at the level of discrete speech units. We show that this embedding can be used to predict the pitch and duration of speech units in a target language. We evaluate our approach to English and French speech signals and show that it outperforms a baseline method.
arXiv Detail & Related papers (2023-06-29T08:06:54Z)
A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning [29.916793641951507]
This paper presents a new benchmark for Aphasia speech recognition using state-of-the-art speech recognition techniques. We introduce two multi-task learning methods based on the CTC/Attention architecture to perform both tasks simultaneously. Our system achieves state-of-the-art speaker-level detection accuracy (97.3%), and a relative WER reduction of 11% for moderate Aphasia patients.
arXiv Detail & Related papers (2023-05-19T15:10:36Z)
Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS. We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z)
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition. Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction. Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z)
Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text. tts4pretrain complements the power of contrastive learning in self-supervision. We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z)
An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes. Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z)
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z)
Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture. We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.