VoiceLDM: Text-to-Speech with Environmental Context
- URL: http://arxiv.org/abs/2309.13664v1
- Date: Sun, 24 Sep 2023 15:20:59 GMT
- Title: VoiceLDM: Text-to-Speech with Environmental Context
- Authors: Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung
- Abstract summary: VoiceLDM is a model designed to produce audio that accurately follows two distinct natural language text prompts.
By utilizing pretrained contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained on large amounts of real-world audio without manual annotations or transcriptions.
We show that VoiceLDM is capable of generating plausible audio that aligns well with both input conditions, even surpassing the speech intelligibility of the ground truth audio on the AudioCaps test set.
- Score: 22.29992463094861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents VoiceLDM, a model designed to produce audio that
accurately follows two distinct natural language text prompts: the description
prompt and the content prompt. The former provides information about the
overall environmental context of the audio, while the latter conveys the
linguistic content. To achieve this, we adopt a text-to-audio (TTA) model based
on latent diffusion models and extend its functionality to incorporate an
additional content prompt as a conditional input. By utilizing pretrained
contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained
on large amounts of real-world audio without manual annotations or
transcriptions. Additionally, we employ dual classifier-free guidance to
further enhance the controllability of VoiceLDM. Experimental results
demonstrate that VoiceLDM is capable of generating plausible audio that aligns
well with both input conditions, even surpassing the speech intelligibility of
the ground truth audio on the AudioCaps test set. Furthermore, we explore the
text-to-speech (TTS) and zero-shot text-to-audio capabilities of VoiceLDM and
show that it achieves competitive results. Demos and code are available at
https://voiceldm.github.io.
Related papers
- Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - AudioLDM: Text-to-Audio Generation with Latent Diffusion Models [35.703877904270726]
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions.
In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining latents.
Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics.
arXiv Detail & Related papers (2023-01-29T17:48:17Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.