MoonCast: High-Quality Zero-Shot Podcast Generation
- URL: http://arxiv.org/abs/2503.14345v2
- Date: Wed, 19 Mar 2025 07:17:41 GMT
- Title: MoonCast: High-Quality Zero-Shot Podcast Generation
- Authors: Zeqian Ju, Dongchao Yang, Jianwei Yu, Kai Shen, Yichong Leng, Zhengtao Wang, Xu Tan, Xinyu Zhou, Tao Qin, Xiangyang Li,
- Abstract summary: MoonCast is a solution for high-quality zero-shot podcast generation.<n>It aims to synthesize natural podcast-style speech from text-only sources.<n>Experiments demonstrate that MoonCast outperforms baselines.
- Score: 81.29927724674602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.
Related papers
- Listening Between the Lines: Decoding Podcast Narratives with Language Modeling [17.51119928424848]
We show that existing large language models, typically trained on more structured text such as news articles, struggle to capture subtle cues that human listeners rely on to identify narrative frames.<n>Our approach then uses these granular frame labels to reveal broader discourse trends.
arXiv Detail & Related papers (2025-11-07T15:12:06Z) - NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion [25.896735200803537]
NaturalVoices (NV) is the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion.<n>It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events.<n>The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles.
arXiv Detail & Related papers (2025-10-31T21:00:14Z) - Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance [47.2016265294791]
Full-Duplex Speech Language Models (FD-SLMs) capture nuanced two-speaker dialogue patterns for human-like interactions.<n>They face a critical challenge -- their conversational abilities often degrade compared to pure-text conversation.<n>We propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning.
arXiv Detail & Related papers (2025-08-10T14:49:43Z) - LoRP-TTS: Low-Rank Personalized Text-To-Speech [0.0]
Speech synthesis models convert written text into natural-sounding audio.<n>Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts.
arXiv Detail & Related papers (2025-02-11T14:00:12Z) - Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates.<n>New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.
Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.
We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations [97.75037148056367]
CoVoMix is a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.<n>We devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation.
arXiv Detail & Related papers (2024-04-10T02:32:58Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - A Two-Phase Approach for Abstractive Podcast Summarization [18.35061145103997]
podcast summarization is different from summarization of other data formats.
We propose a two-phase approach: sentence selection and seq2seq learning.
Our approach achieves promising results regarding both ROUGE-based measures and human evaluations.
arXiv Detail & Related papers (2020-11-16T21:31:28Z) - A Baseline Analysis for Podcast Abstractive Summarization [18.35061145103997]
This paper presents a baseline analysis of podcast summarization using the Spotify Podcast dataset.
It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
arXiv Detail & Related papers (2020-08-24T18:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.