Proceedings of the ICML 2022 Expressive Vocalizations Workshop and
Competition: Recognizing, Generating, and Personalizing Vocal Bursts
- URL: http://arxiv.org/abs/2207.06958v1
- Date: Thu, 14 Jul 2022 14:30:34 GMT
- Title: Proceedings of the ICML 2022 Expressive Vocalizations Workshop and
Competition: Recognizing, Generating, and Personalizing Vocal Bursts
- Authors: Alice Baird, Panagiotis Tzirakis, Gauthier Gidel, Marco Jiralerspong,
Eilif B. Muller, Kory Mathewson, Bj\"orn Schuller, Erik Cambria, Dacher
Keltner, Alan Cowen
- Abstract summary: ExVo 2022, included three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts.
The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions.
- Score: 28.585851793516873
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This is the Proceedings of the ICML Expressive Vocalization (ExVo)
Competition. The ExVo competition focuses on understanding and generating vocal
bursts: laughs, gasps, cries, and other non-verbal vocalizations that are
central to emotional expression and communication. ExVo 2022, included three
competition tracks using a large-scale dataset of 59,201 vocalizations from
1,702 speakers. The first, ExVo-MultiTask, requires participants to train a
multi-task model to recognize expressed emotions and demographic traits from
vocal bursts. The second, ExVo-Generate, requires participants to train a
generative model that produces vocal bursts conveying ten different emotions.
The third, ExVo-FewShot, requires participants to leverage few-shot learning
incorporating speaker identity to train a model for the recognition of 10
emotions conveyed by vocal bursts.
Related papers
- Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - EmoGator: A New Open Source Vocal Burst Dataset with Baseline Machine
Learning Classification Methodologies [0.0]
The EmoGator dataset consists of 32,130 samples from 357 speakers, 16.9654 hours of audio.
Each sample classified into one of 30 distinct emotion categories by the speaker.
arXiv Detail & Related papers (2023-01-02T03:02:10Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Fine-tuning Wav2vec for Vocal-burst Emotion Recognition [7.910908058662372]
The ACII Vocal Affective Bursts (A-VB) competition introduces a new topic in affective computing.
The vocal bursts such as laughs, cries, and signs are not exploited even though they are very informative for behavior analysis.
This technical report describes the method and the result of SclabCNU Team for the tasks of the challenge.
arXiv Detail & Related papers (2022-10-01T12:03:27Z) - The ACII 2022 Affective Vocal Bursts Workshop & Competition:
Understanding a critically understudied modality of emotional expression [16.364737403587235]
This paper describes the four tracks and baseline systems, which use state-of-the-art machine learning methods.
This year's competition comprises four tracks using a dataset of 59,299 vocalizations from 1,702 speakers.
The baseline performance for each track is obtained by utilizing an end-to-end deep learning model.
arXiv Detail & Related papers (2022-07-07T21:09:35Z) - Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion,
Age, and Origin from Vocal Bursts [49.31604138034298]
Burst2Vec uses pre-trained speech representations to capture acoustic information from raw waveforms.
Our models achieve a relative 30 % performance gain over baselines using pre-extracted features.
arXiv Detail & Related papers (2022-06-24T18:57:41Z) - The ICML 2022 Expressive Vocalizations Workshop and Competition:
Recognizing, Generating, and Personalizing Vocal Bursts [28.585851793516873]
ExVo 2022 includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies.
arXiv Detail & Related papers (2022-05-03T21:06:44Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.