The ICML 2022 Expressive Vocalizations Workshop and Competition:
Recognizing, Generating, and Personalizing Vocal Bursts
- URL: http://arxiv.org/abs/2205.01780v1
- Date: Tue, 3 May 2022 21:06:44 GMT
- Title: The ICML 2022 Expressive Vocalizations Workshop and Competition:
Recognizing, Generating, and Personalizing Vocal Bursts
- Authors: Alice Baird, Panagiotis Tzirakis, Gauthier Gidel, Marco Jiralerspong,
Eilif B. Muller, Kory Mathewson, Bj\"orn Schuller, Erik Cambria, Dacher
Keltner, Alan Cowen
- Abstract summary: ExVo 2022 includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies.
- Score: 28.585851793516873
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The ICML Expressive Vocalization (ExVo) Competition is focused on
understanding and generating vocal bursts: laughs, gasps, cries, and other
non-verbal vocalizations that are central to emotional expression and
communication. ExVo 2022, includes three competition tracks using a large-scale
dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask,
requires participants to train a multi-task model to recognize expressed
emotions and demographic traits from vocal bursts. The second, ExVo-Generate,
requires participants to train a generative model that produces vocal bursts
conveying ten different emotions. The third, ExVo-FewShot, requires
participants to leverage few-shot learning incorporating speaker identity to
train a model for the recognition of 10 emotions conveyed by vocal bursts. This
paper describes the three tracks and provides performance measures for baseline
models using state-of-the-art machine learning strategies. The baseline for
each track is as follows, for ExVo-MultiTask, a combined score, computing the
harmonic mean of Concordance Correlation Coefficient (CCC), Unweighted Average
Recall (UAR), and inverted Mean Absolute Error (MAE) ($S_{MTL}$) is at best,
0.335 $S_{MTL}$; for ExVo-Generate, we report Fr\'echet inception distance
(FID) scores ranging from 4.81 to 8.27 (depending on the emotion) between the
training set and generated samples. We then combine the inverted FID with
perceptual ratings of the generated samples ($S_{Gen}$) and obtain 0.174
$S_{Gen}$; and for ExVo-FewShot, a mean CCC of 0.444 is obtained.
Related papers
- Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning [55.127202990679976]
We introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories.
This dataset enables models to learn from varied scenarios and generalize to real-world applications.
We propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders.
arXiv Detail & Related papers (2024-06-17T03:01:22Z) - CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations [97.75037148056367]
CoVoMix is a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.
We devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation.
arXiv Detail & Related papers (2024-04-10T02:32:58Z) - MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models [13.137392771279742]
This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations.
We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction framework that integrates text, audio, and visual modalities.
arXiv Detail & Related papers (2024-03-31T01:16:02Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - EmoGator: A New Open Source Vocal Burst Dataset with Baseline Machine
Learning Classification Methodologies [0.0]
The EmoGator dataset consists of 32,130 samples from 357 speakers, 16.9654 hours of audio.
Each sample classified into one of 30 distinct emotion categories by the speaker.
arXiv Detail & Related papers (2023-01-02T03:02:10Z) - Proceedings of the ICML 2022 Expressive Vocalizations Workshop and
Competition: Recognizing, Generating, and Personalizing Vocal Bursts [28.585851793516873]
ExVo 2022, included three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts.
The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions.
arXiv Detail & Related papers (2022-07-14T14:30:34Z) - The ACII 2022 Affective Vocal Bursts Workshop & Competition:
Understanding a critically understudied modality of emotional expression [16.364737403587235]
This paper describes the four tracks and baseline systems, which use state-of-the-art machine learning methods.
This year's competition comprises four tracks using a dataset of 59,299 vocalizations from 1,702 speakers.
The baseline performance for each track is obtained by utilizing an end-to-end deep learning model.
arXiv Detail & Related papers (2022-07-07T21:09:35Z) - Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms [14.046451550358427]
We describe an approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition.
We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples.
The mel-spectrograms generated by the model are then inverted back to the audio domain.
arXiv Detail & Related papers (2022-06-25T05:39:52Z) - Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion,
Age, and Origin from Vocal Bursts [49.31604138034298]
Burst2Vec uses pre-trained speech representations to capture acoustic information from raw waveforms.
Our models achieve a relative 30 % performance gain over baselines using pre-extracted features.
arXiv Detail & Related papers (2022-06-24T18:57:41Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.