Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition
- URL: http://arxiv.org/abs/2110.11479v1
- Date: Thu, 21 Oct 2021 21:11:42 GMT
- Title: Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition
- Authors: Ting-Yao Hu, Mohammadreza Armandpour, Ashish Shrivastava, Jen-Hao Rick
Chang, Hema Koppula, Oncel Tuzel
- Abstract summary: Machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions.
We propose two novel techniques during training to mitigate the problems due to the distribution gap.
We show that these methods significantly improve the training of speech recognition models using synthetic data.
- Score: 18.924716098922683
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With recent advances in speech synthesis, synthetic data is becoming a viable
alternative to real data for training speech recognition models. However,
machine learning with synthetic data is not trivial due to the gap between the
synthetic and the real data distributions. Synthetic datasets may contain
artifacts that do not exist in real data such as structured noise, content
errors, or unrealistic speaking styles. Moreover, the synthesis process may
introduce a bias due to uneven sampling of the data manifold. We propose two
novel techniques during training to mitigate the problems due to the
distribution gap: (i) a rejection sampling algorithm and (ii) using separate
batch normalization statistics for the real and the synthetic samples. We show
that these methods significantly improve the training of speech recognition
models using synthetic data. We evaluate the proposed approach on keyword
detection and Automatic Speech Recognition (ASR) tasks, and observe up to 18%
and 13% relative error reduction, respectively, compared to naively using the
synthetic data.
Related papers
- Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Are Synthetic Time-series Data Really not as Good as Real Data? [29.852306720544224]
Time-series data presents limitations stemming from data quality issues, bias and vulnerabilities, and generalization problem.
We introduce InfoBoost -- a highly versatile cross-domain data synthesizing framework with time series representation learning capability.
We have developed a method based on synthetic data that enables model training without the need for real data, surpassing the performance of models trained with real data.
arXiv Detail & Related papers (2024-02-01T13:59:04Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms [5.366354612549173]
We focus on data-synthesis methods to create high-quality synthetic data.
We present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data.
Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
arXiv Detail & Related papers (2022-04-08T07:48:57Z) - Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data.
Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.