Self-Supervised Attention Networks and Uncertainty Loss Weighting for
Multi-Task Emotion Recognition on Vocal Bursts
- URL: http://arxiv.org/abs/2209.07384v1
- Date: Thu, 15 Sep 2022 15:50:27 GMT
- Title: Self-Supervised Attention Networks and Uncertainty Loss Weighting for
Multi-Task Emotion Recognition on Vocal Bursts
- Authors: Vincent Karas, Andreas Triantafyllopoulos, Meishu Song and Bj\"orn W.
Schuller
- Abstract summary: We present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB)
Our approach surpasses the challenge baseline by a wide margin on all four tasks.
- Score: 5.3802825558183835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vocal bursts play an important role in communicating affect, making them
valuable for improving speech emotion recognition. Here, we present our
approach for classifying vocal bursts and predicting their emotional
significance in the ACII Affective Vocal Burst Workshop & Challenge 2022
(A-VB). We use a large self-supervised audio model as shared feature extractor
and compare multiple architectures built on classifier chains and attention
networks, combined with uncertainty loss weighting strategies. Our approach
surpasses the challenge baseline by a wide margin on all four tasks.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Effect of Attention and Self-Supervised Speech Embeddings on
Non-Semantic Speech Tasks [3.570593982494095]
We look at speech emotion understanding as a perception task which is a more realistic setting.
We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion.
Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
arXiv Detail & Related papers (2023-08-28T07:11:27Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - An Efficient Multitask Learning Architecture for Affective Vocal Burst
Analysis [1.2951621755732543]
Current approaches to address affective vocal burst analysis are mostly based on wav2vec2 or HuBERT features.
In this paper, we investigate the use of the wav2vec successor data2vec in combination with a multitask learning pipeline to tackle different analysis problems at once.
To assess the performance of our efficient multitask learning architecture, we participate in the 2022 ACII Affective Vocal Burst Challenge.
arXiv Detail & Related papers (2022-09-28T08:32:08Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Speaker Attentive Speech Emotion Recognition [11.92436948211501]
Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs)
We present novel work based on the idea of teaching the emotion recognition network about speaker identity.
arXiv Detail & Related papers (2021-04-15T07:59:37Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.