Pheme: Efficient and Conversational Speech Generation
- URL: http://arxiv.org/abs/2401.02839v1
- Date: Fri, 5 Jan 2024 14:47:20 GMT
- Title: Pheme: Efficient and Conversational Speech Generation
- Authors: Pawe{\l} Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vuli\'c
- Abstract summary: We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
- Score: 52.34331755341856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, speech generation has seen remarkable progress, now
achieving one-shot generation capability that is often virtually
indistinguishable from real human voice. Integrating such advancements in
speech generation with large language models might revolutionize a wide range
of applications. However, certain applications, such as assistive
conversational systems, require natural and conversational speech generation
tools that also operate efficiently in real time. Current state-of-the-art
models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs,
require large neural components and extensive training data to work well. In
contrast, MQTTS aims to build more compact conversational TTS models while
capitalizing on smaller-scale real-life conversational speech data. However,
its autoregressive nature yields high inference latency and thus limits its
real-time usage. In order to mitigate the current limitations of the
state-of-the-art TTS models while capitalizing on their strengths, in this work
we introduce the Pheme model series that 1) offers compact yet high-performing
models, 2) allows for parallel speech generation of 3) natural conversational
speech, and 4) it can be trained efficiently on smaller-scale conversational
data, cutting data demands by more than 10x but still matching the quality of
the autoregressive TTS models. We also show that through simple teacher-student
distillation we can meet significant improvements in voice quality for
single-speaker setups on top of pretrained Pheme checkpoints, relying solely on
synthetic speech generated by much larger teacher models. Audio samples and
pretrained models are available online.
Related papers
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming [0.0]
Mini- Omni is an audio-based end-to-end conversational model capable of real-time speech interaction.
We propose a text-instructed speech generation method, along with batch-parallel strategies during inference to boost the performance.
We also introduce the VoiceAssistant-400K dataset to fine-tune models for optimized speech output.
arXiv Detail & Related papers (2024-08-29T17:18:53Z) - Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based
on Transfer Learning [0.802904964931021]
The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space.
This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
arXiv Detail & Related papers (2021-02-10T18:43:56Z) - Low-resource expressive text-to-speech using data augmentation [12.396086122947679]
We present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data.
We augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers.
Next, we use that synthetic data on top of the available recordings to train a TTS model.
arXiv Detail & Related papers (2020-11-11T11:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.