Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement
- URL: http://arxiv.org/abs/2501.13372v1
- Date: Thu, 23 Jan 2025 04:27:37 GMT
- Title: Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement
- Authors: Jae-Sung Bae, Anastasia Kuznetsova, Dinesh Manocha, John Hershey, Trausti Kristjansson, Minje Kim,
- Abstract summary: This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE)
We aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance.
- Score: 54.51467153859695
- License:
- Abstract: This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To address these issues, synthetic data generation using generative models has gained significant attention. In this challenge, participants are tasked first with building zero-shot TTS systems to augment personalized data. Subsequently, PSE systems are asked to be trained with this augmented personalized dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance. We also provide baseline experiments using open-source zero-shot TTS models to encourage participation and benchmark advancements. Our baseline code implementation and checkpoints are available online.
Related papers
- Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.
We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Multi-speaker Text-to-speech Training with Speaker Anonymized Data [40.70515431989197]
We investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA)
Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset.
We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data.
arXiv Detail & Related papers (2024-05-20T03:55:44Z) - An Automated End-to-End Open-Source Software for High-Quality
Text-to-Speech Dataset Generation [3.6893151241749966]
This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models.
The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection.
The proposed application aims to streamline the dataset creation process for TTS models through these features.
arXiv Detail & Related papers (2024-02-26T07:58:33Z) - Comparative Analysis of Transfer Learning in Deep Learning
Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset [10.119929769316565]
This thesis is rooted in the pressing need to find TTS models that require less training time, fewer data samples, yet yield high-quality voice output.
The research evaluates TTS state-of-the-art model transfer learning capabilities through a thorough technical analysis.
It then conducts a hands-on experimental analysis to compare models' performance in a constrained dataset.
arXiv Detail & Related papers (2023-10-08T03:08:25Z) - ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT [2.320417845168326]
We investigate the use of data obtained from prompting a large generative language model, ChatGPT, to generate synthetic training data with the aim of augmenting data in low resource scenarios.
We show that with appropriate task-specific ChatGPT prompts, we outperform the most popular existing approaches for such data augmentation.
arXiv Detail & Related papers (2023-04-27T17:07:29Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy
Loss [49.62291237343537]
We propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network.
With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models.
arXiv Detail & Related papers (2020-10-22T20:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.