An Efficient Multitask Learning Architecture for Affective Vocal Burst
Analysis
- URL: http://arxiv.org/abs/2209.13914v1
- Date: Wed, 28 Sep 2022 08:32:08 GMT
- Title: An Efficient Multitask Learning Architecture for Affective Vocal Burst
Analysis
- Authors: Tobias Hallmen, Silvan Mertes, Dominik Schiller, Elisabeth Andr\'e
- Abstract summary: Current approaches to address affective vocal burst analysis are mostly based on wav2vec2 or HuBERT features.
In this paper, we investigate the use of the wav2vec successor data2vec in combination with a multitask learning pipeline to tackle different analysis problems at once.
To assess the performance of our efficient multitask learning architecture, we participate in the 2022 ACII Affective Vocal Burst Challenge.
- Score: 1.2951621755732543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Affective speech analysis is an ongoing topic of research. A relatively new
problem in this field is the analysis of vocal bursts, which are nonverbal
vocalisations such as laughs or sighs. Current state-of-the-art approaches to
address affective vocal burst analysis are mostly based on wav2vec2 or HuBERT
features. In this paper, we investigate the use of the wav2vec successor
data2vec in combination with a multitask learning pipeline to tackle different
analysis problems at once. To assess the performance of our efficient multitask
learning architecture, we participate in the 2022 ACII Affective Vocal Burst
Challenge, showing that our approach substantially outperforms the baseline
established there in three different subtasks.
Related papers
- PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA)
To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements.
To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z) - Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction [6.1058750788332325]
We introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild.
Our methodology utilise the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset.
We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector.
arXiv Detail & Related papers (2024-03-18T15:32:02Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - A Comparative Study of Pre-trained Speech and Audio Embeddings for
Speech Emotion Recognition [0.0]
Speech Emotion Recognition (SER) has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning.
Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks.
We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms on the derived embeddings.
The results of our study indicate that the best performance is achieved by algorithms trained on embeddings
arXiv Detail & Related papers (2023-04-22T19:56:35Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - Self-Supervised Attention Networks and Uncertainty Loss Weighting for
Multi-Task Emotion Recognition on Vocal Bursts [5.3802825558183835]
We present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB)
Our approach surpasses the challenge baseline by a wide margin on all four tasks.
arXiv Detail & Related papers (2022-09-15T15:50:27Z) - Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion,
Age, and Origin from Vocal Bursts [49.31604138034298]
Burst2Vec uses pre-trained speech representations to capture acoustic information from raw waveforms.
Our models achieve a relative 30 % performance gain over baselines using pre-extracted features.
arXiv Detail & Related papers (2022-06-24T18:57:41Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.