Related papers: Natural language guidance of high-fidelity text-to-speech with synthetic annotations

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

URL: http://arxiv.org/abs/2402.01912v1
Date: Fri, 2 Feb 2024 21:29:34 GMT
Title: Natural language guidance of high-fidelity text-to-speech with synthetic annotations
Authors: Dan Lyth, Simon King
Abstract summary: We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
Score: 13.642358232817342
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.

Related papers

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z)
Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach [14.5696754689252]
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations.
arXiv Detail & Related papers (2024-09-16T10:29:15Z)
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description [19.064845530513285]
We propose an automatic speech annotation system for interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Our system provides in-depth understandings of speech style through tailored natural language descriptions. It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips.
arXiv Detail & Related papers (2024-08-24T15:36:08Z)
dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel) Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z)
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data. This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time. In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z)
TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes. To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z)
Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings. We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.