Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language
- URL: http://arxiv.org/abs/2408.10128v2
- Date: Fri, 23 Aug 2024 16:15:30 GMT
- Title: Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language
- Authors: Manjil Karki, Pratik Shakya, Sandesh Acharya, Ravi Pandit, Dinesh Gothe,
- Abstract summary: A neural vocal cloning system can mimic someone's voice using just a few audio samples.
Speaker encoding and speaker adaptation are topics of research in the field of voice cloning.
The main goal is to create a vocal cloning system that produces audio output with a Nepali accent.
- Score: 0.4810348726854312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice cloning is a prominent feature in personalized speech interfaces. A neural vocal cloning system can mimic someone's voice using just a few audio samples. Both speaker encoding and speaker adaptation are topics of research in the field of voice cloning. Speaker adaptation relies on fine-tuning a multi-speaker generative model, which involves training a separate model to infer a new speaker embedding used for speaker encoding. Both methods can achieve excellent performance, even with a small number of cloning audios, in terms of the speech's naturalness and similarity to the original speaker. Speaker encoding approaches are more appropriate for low-resource deployment since they require significantly less memory and have a faster cloning time than speaker adaption, which can offer slightly greater naturalness and similarity. The main goal is to create a vocal cloning system that produces audio output with a Nepali accent or that sounds like Nepali. For the further advancement of TTS, the idea of transfer learning was effectively used to address several issues that were encountered in the development of this system, including the poor audio quality and the lack of available data.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Meta-Voice: Fast few-shot style transfer for expressive voice cloning
using meta learning [37.73490851004852]
Task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited amount of neutral data.
This is a very challenging task since the learning algorithm needs to deal with few-shot voice cloning and speaker-prosody disentanglement at the same time.
In this paper, we approach to the hard fast few-shot style transfer for voice cloning task using meta learning.
arXiv Detail & Related papers (2021-11-14T01:30:37Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based
on Transfer Learning [0.802904964931021]
The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space.
This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
arXiv Detail & Related papers (2021-02-10T18:43:56Z) - Expressive Neural Voice Cloning [12.010555227327743]
We propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker.
We show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker.
arXiv Detail & Related papers (2021-01-30T05:09:57Z) - Latent linguistic embedding for cross-lingual text-to-speech and voice
conversion [44.700803634034486]
Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally.
We show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps.
arXiv Detail & Related papers (2020-10-08T01:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.