Comparative Analysis of Transfer Learning in Deep Learning
Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset
- URL: http://arxiv.org/abs/2310.04982v1
- Date: Sun, 8 Oct 2023 03:08:25 GMT
- Title: Comparative Analysis of Transfer Learning in Deep Learning
Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset
- Authors: Ze Liu
- Abstract summary: This thesis is rooted in the pressing need to find TTS models that require less training time, fewer data samples, yet yield high-quality voice output.
The research evaluates TTS state-of-the-art model transfer learning capabilities through a thorough technical analysis.
It then conducts a hands-on experimental analysis to compare models' performance in a constrained dataset.
- Score: 10.119929769316565
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-to-Speech (TTS) synthesis using deep learning relies on voice quality.
Modern TTS models are advanced, but they need large amount of data. Given the
growing computational complexity of these models and the scarcity of large,
high-quality datasets, this research focuses on transfer learning, especially
on few-shot, low-resource, and customized datasets. In this research,
"low-resource" specifically refers to situations where there are limited
amounts of training data, such as a small number of audio recordings and
corresponding transcriptions for a particular language or dialect. This thesis,
is rooted in the pressing need to find TTS models that require less training
time, fewer data samples, yet yield high-quality voice output. The research
evaluates TTS state-of-the-art model transfer learning capabilities through a
thorough technical analysis. It then conducts a hands-on experimental analysis
to compare models' performance in a constrained dataset. This study
investigates the efficacy of modern TTS systems with transfer learning on
specialized datasets and a model that balances training efficiency and
synthesis quality. Initial hypotheses suggest that transfer learning could
significantly improve TTS models' performance on compact datasets, and an
optimal model may exist for such unique conditions. This thesis predicts a rise
in transfer learning in TTS as data scarcity increases. In the future, custom
TTS applications will favour models optimized for specific datasets over
generic, data-intensive ones.
Related papers
- Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model [13.45344843458971]
Keywords spotting models require huge amount of training data to be accurate.
TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time for KWS model development.
We explore various strategies to mix TTS data and real human speech data, with a focus on minimizing real data use and maximizing diversity of TTS output.
arXiv Detail & Related papers (2024-07-26T17:24:50Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Semi-Supervised Learning Based on Reference Model for Low-resource TTS [32.731900584216724]
We propose a semi-supervised learning method for neural TTS in which labeled target data is limited.
Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.
arXiv Detail & Related papers (2022-10-25T07:48:07Z) - Impact of Dataset on Acoustic Models for Automatic Speech Recognition [0.0]
In Automatic Speech Recognition, GMM-HMM had been widely used for acoustic modelling.
The GMM models are widely used to create the alignments of the training data for the hybrid deep neural network model.
This work aims to investigate the impact of dataset size variations on the performance of various GMM-HMM Acoustic Models.
arXiv Detail & Related papers (2022-03-25T11:41:49Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.