HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis
- URL: http://arxiv.org/abs/2311.12454v2
- Date: Mon, 27 Nov 2023 12:26:32 GMT
- Title: HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis
- Authors: Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee
- Abstract summary: Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis.
This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
- Score: 39.892633589217326
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLM)-based speech synthesis has been widely adopted in
zero-shot speech synthesis. However, they require a large-scale data and
possess the same limitations as previous autoregressive speech models,
including slow inference speed and lack of robustness. This paper proposes
HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech
(TTS) and voice conversion (VC). We verified that hierarchical speech synthesis
frameworks could significantly improve the robustness and expressiveness of the
synthetic speech. Furthermore, we significantly improve the naturalness and
speaker similarity of synthetic speech even in zero-shot speech synthesis
scenarios. For text-to-speech, we adopt the text-to-vec framework, which
generates a self-supervised speech representation and an F0 representation
based on text representations and prosody prompts. Then, HierSpeech++ generates
speech from the generated vector, F0, and voice prompt. We further introduce a
high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The
experimental results demonstrated that the hierarchical variational autoencoder
could be a strong zero-shot speech synthesizer given that it outperforms
LLM-based and diffusion-based models. Moreover, we achieved the first
human-level quality zero-shot speech synthesis. Audio samples and source code
are available at https://github.com/sh-lee-prml/HierSpeechpp.
Related papers
- SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer [68.78023656892319]
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech.
SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step.
arXiv Detail & Related papers (2025-02-16T12:14:17Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.
Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.
We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - FlashSpeech: Efficient Zero-Shot Speech Synthesis [37.883762387219676]
FlashSpeech is a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work.
We show that FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity.
arXiv Detail & Related papers (2024-04-23T02:57:46Z) - SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models [58.996653700982556]
Existing speech tokens are not specifically designed for speech language modeling.
We propose SpeechTokenizer, a unified speech tokenizer for speech large language models.
Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
arXiv Detail & Related papers (2023-08-31T12:53:09Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.