MoonCast: High-Quality Zero-Shot Podcast Generation
- URL: http://arxiv.org/abs/2503.14345v2
- Date: Wed, 19 Mar 2025 07:17:41 GMT
- Title: MoonCast: High-Quality Zero-Shot Podcast Generation
- Authors: Zeqian Ju, Dongchao Yang, Jianwei Yu, Kai Shen, Yichong Leng, Zhengtao Wang, Xu Tan, Xinyu Zhou, Tao Qin, Xiangyang Li,
- Abstract summary: MoonCast is a solution for high-quality zero-shot podcast generation.<n>It aims to synthesize natural podcast-style speech from text-only sources.<n>Experiments demonstrate that MoonCast outperforms baselines.
- Score: 81.29927724674602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.
Related papers
- LoRP-TTS: Low-Rank Personalized Text-To-Speech [0.0]
Speech synthesis models convert written text into natural-sounding audio.<n>Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts.
arXiv Detail & Related papers (2025-02-11T14:00:12Z) - Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates.<n>New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.
Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.
We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations [97.75037148056367]
CoVoMix is a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.<n>We devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation.
arXiv Detail & Related papers (2024-04-10T02:32:58Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - A Two-Phase Approach for Abstractive Podcast Summarization [18.35061145103997]
podcast summarization is different from summarization of other data formats.
We propose a two-phase approach: sentence selection and seq2seq learning.
Our approach achieves promising results regarding both ROUGE-based measures and human evaluations.
arXiv Detail & Related papers (2020-11-16T21:31:28Z) - A Baseline Analysis for Podcast Abstractive Summarization [18.35061145103997]
This paper presents a baseline analysis of podcast summarization using the Spotify Podcast dataset.
It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
arXiv Detail & Related papers (2020-08-24T18:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.