An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis
- URL: http://arxiv.org/abs/2403.12402v1
- Date: Tue, 19 Mar 2024 03:22:28 GMT
- Title: An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis
- Authors: Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong,
- Abstract summary: Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning.
We study how the synthesized audio is controlled by the prompt and content.
- Score: 45.558316325252335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as prompt, and synthesizes speech which preserves the content's semantics but mimics the prompt's style. However, there is no systematic understanding on how the synthesized audio is controlled by the prompt and content. In this work, we conduct an empirical study of the widely used autoregressive (AR) and non-autoregressive (NAR) speech LMs and provide insights into the prompt design and content semantic units. Our analysis reveals that heterogeneous and nonstationary prompts hurt the audio quality in contrast to the previous finding that longer prompts always lead to better synthesis. Moreover, we find that the speaker style of the synthesized audio is also affected by the content in addition to the prompt. We further show that semantic units carry rich acoustic information such as pitch, tempo, volume and speech emphasis, which might be leaked from the content to the synthesized audio.
Related papers
- SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought [12.54786997634534]
This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST.
We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting.
The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units.
arXiv Detail & Related papers (2024-05-30T18:28:31Z) - On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser.
We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised.
We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - How Generative Spoken Language Modeling Encodes Noisy Speech:
Investigation from Phonetics to Syntactics [33.070158866023]
generative spoken language modeling (GSLM) involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis.
This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels.
arXiv Detail & Related papers (2023-06-01T14:07:19Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.