Related papers: PodAgent: A Comprehensive Framework for Podcast Generation

PodAgent: A Comprehensive Framework for Podcast Generation

URL: http://arxiv.org/abs/2503.00455v1
Date: Sat, 01 Mar 2025 11:35:17 GMT
Title: PodAgent: A Comprehensive Framework for Podcast Generation
Authors: Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee,
Abstract summary: PodAgent is a framework for creating podcast-like audio programs.<n>It generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system.<n>It builds a voice pool for suitable voice-role matching and utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech.
Score: 27.525007982804425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.

Related papers

Fun-Audio-Chat Technical Report [71.07966678560291]
temporal resolution between speech tokens (25Hz) and text tokens (3Hz) mitigates semantic information mismatchs and incurs high computational costs.<n>We introduce Fun-Audio-Chat, a Large-stage Spoken Speech-to-scale tasks.<n>Fun-Audio-Chat 8B and MoE 30BA3B achieve competitive performance on SpeechText and Speech-to-scale tasks.
arXiv Detail & Related papers (2025-12-23T08:35:27Z)
DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition [10.94195981338177]
DialogGraph-LLM is an end-to-end framework for recognizing speaker intent in audio dialogues.<n>It combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models for direct acoustic-to-intent inference.<n>The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues.
arXiv Detail & Related papers (2025-11-14T06:42:04Z)
PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation [32.72155456223403]
We take podcast-like audio generation as a starting point and propose PodEval, a comprehensive and well-designed open-source evaluation framework.<n>In this framework, we construct a real-world podcast dataset spanning diverse topics, serving as a reference for human-level creative quality.<n>The results offer in-depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval.
arXiv Detail & Related papers (2025-10-01T04:08:08Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching [79.0241611035794]
CoVoMix2 is a framework for zero-shot multi-talker dialogue generation.<n>It predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model.<n>Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed.
arXiv Detail & Related papers (2025-06-01T07:51:45Z)
MoonCast: High-Quality Zero-Shot Podcast Generation [81.29927724674602]
MoonCast is a solution for high-quality zero-shot podcast generation. It aims to synthesize natural podcast-style speech from text-only sources. Experiments demonstrate that MoonCast outperforms baselines.
arXiv Detail & Related papers (2025-03-18T15:25:08Z)
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction [9.101978573666546]
Baichuan-Audio is an end-to-end audio large language model that seamlessly integrates audio understanding and generation.<n>It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities.
arXiv Detail & Related papers (2025-02-24T15:16:34Z)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z)
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling [62.25533750469467]
We propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of videos available on modern streaming services.
arXiv Detail & Related papers (2024-01-22T15:26:01Z)
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z)
VoiceLDM: Text-to-Speech with Environmental Context [22.29992463094861]
VoiceLDM is a model designed to produce audio that accurately follows two distinct natural language text prompts. By utilizing pretrained contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained on large amounts of real-world audio without manual annotations or transcriptions. We show that VoiceLDM is capable of generating plausible audio that aligns well with both input conditions, even surpassing the speech intelligibility of the ground truth audio on the AudioCaps test set.
arXiv Detail & Related papers (2023-09-24T15:20:59Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
PodSumm -- Podcast Audio Summarization [0.0]
We propose a method to automatically construct a podcast summary via guidance from the text-domain. Motivated by a lack of datasets for this task, we curate an internal dataset, find an effective scheme for data augmentation, and design a protocol to gather summaries from annotators. Our method achieves ROUGE-F(1/2/L) scores of 0.63/0.53/0.63 on our dataset.
arXiv Detail & Related papers (2020-09-22T04:49:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.