Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms
- URL: http://arxiv.org/abs/2206.12563v1
- Date: Sat, 25 Jun 2022 05:39:52 GMT
- Title: Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms
- Authors: Marco Jiralerspong and Gauthier Gidel
- Abstract summary: We describe an approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition.
We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples.
The mel-spectrograms generated by the model are then inverted back to the audio domain.
- Score: 14.046451550358427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe our approach for the generative emotional vocal burst task (ExVo
Generate) of the ICML Expressive Vocalizations Competition. We train a
conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions
of the audio samples. The mel-spectrograms generated by the model are then
inverted back to the audio domain. As a result, our generated samples
substantially improve upon the baseline provided by the competition from a
qualitative and quantitative perspective for all emotions. More precisely, even
for our worst-performing emotion (awe), we obtain an FAD of 1.76 compared to
the baseline of 4.81 (as a reference, the FAD between the train/validation sets
for awe is 0.776).
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust
Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation [41.98697872087318]
We introduce Diff-HierVC, a hierarchical VC system based on two diffusion models.
Our model achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
arXiv Detail & Related papers (2023-11-08T14:02:53Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - SpeechBlender: Speech Augmentation Framework for Mispronunciation Data
Generation [11.91301106502376]
SpeechBlender is a fine-grained data augmentation pipeline for generating mispronunciation errors.
Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models.
arXiv Detail & Related papers (2022-11-02T07:13:30Z) - Proceedings of the ICML 2022 Expressive Vocalizations Workshop and
Competition: Recognizing, Generating, and Personalizing Vocal Bursts [28.585851793516873]
ExVo 2022, included three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts.
The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions.
arXiv Detail & Related papers (2022-07-14T14:30:34Z) - The ICML 2022 Expressive Vocalizations Workshop and Competition:
Recognizing, Generating, and Personalizing Vocal Bursts [28.585851793516873]
ExVo 2022 includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies.
arXiv Detail & Related papers (2022-05-03T21:06:44Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice.
It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder.
Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.