PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
- URL: http://arxiv.org/abs/2510.00485v1
- Date: Wed, 01 Oct 2025 04:08:08 GMT
- Title: PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
- Authors: Yujia Xiao, Liumeng Xue, Lei He, Xinyi Chen, Aemon Yat Fei Chiu, Wenjie Tian, Shaofei Zhang, Qiuqiang Kong, Xinfa Zhu, Wei Xue, Tan Lee,
- Abstract summary: We take podcast-like audio generation as a starting point and propose PodEval, a comprehensive and well-designed open-source evaluation framework.<n>In this framework, we construct a real-world podcast dataset spanning diverse topics, serving as a reference for human-level creative quality.<n>The results offer in-depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval.
- Score: 32.72155456223403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, an increasing number of multimodal (text and audio) benchmarks have emerged, primarily focusing on evaluating models' understanding capability. However, exploration into assessing generative capabilities remains limited, especially for open-ended long-form content generation. Significant challenges lie in no reference standard answer, no unified evaluation metrics and uncontrollable human judgments. In this work, we take podcast-like audio generation as a starting point and propose PodEval, a comprehensive and well-designed open-source evaluation framework. In this framework: 1) We construct a real-world podcast dataset spanning diverse topics, serving as a reference for human-level creative quality. 2) We introduce a multimodal evaluation strategy and decompose the complex task into three dimensions: text, speech and audio, with different evaluation emphasis on "Content" and "Format". 3) For each modality, we design corresponding evaluation methods, involving both objective metrics and subjective listening test. We leverage representative podcast generation systems (including open-source, close-source, and human-made) in our experiments. The results offer in-depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval in evaluating open-ended long-form audio. This project is open-source to facilitate public use: https://github.com/yujxx/PodEval.
Related papers
- AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech [56.08149157180447]
We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models.<n>We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework.
arXiv Detail & Related papers (2026-02-27T03:33:37Z) - SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation [52.468945848774844]
This paper addresses the need for automated systems capable of evaluating audio separation without human intervention.<n>The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric.<n>SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation.
arXiv Detail & Related papers (2026-01-27T15:29:02Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - Discrete Audio Tokens: More Than a Survey! [137.3721175670642]
This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
arXiv Detail & Related papers (2025-06-12T01:35:43Z) - Rhapsody: A Dataset for Highlight Detection in Podcasts [49.576469262265455]
We introduce Rhapsody, a 13K podcast episodes paired with segment-level highlight.<n>We frame the podcast highlight detection as a segment-level binary classification task.<n>We explore various baseline language models and lightweight fine-tuned language models.
arXiv Detail & Related papers (2025-05-26T02:39:34Z) - PodAgent: A Comprehensive Framework for Podcast Generation [27.525007982804425]
PodAgent is a framework for creating podcast-like audio programs.<n>It generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system.<n>It builds a voice pool for suitable voice-role matching and utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech.
arXiv Detail & Related papers (2025-03-01T11:35:17Z) - Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus [23.70786221902932]
We introduce a massive dataset of over 1.1M podcast transcripts available through public RSS feeds from May and June of 2020.
This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes.
Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this popular impactful medium.
arXiv Detail & Related papers (2024-11-12T15:56:48Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Topic Modeling on Podcast Short-Text Metadata [0.9539495585692009]
We assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using modeling techniques for short text.
We propose a new strategy to named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization modeling framework.
Our experiments on two existing datasets from Spotify and iTunes and Deezer, show that our proposed document representation, NEiCE, leads to improved coherence over the baselines.
arXiv Detail & Related papers (2022-01-12T11:07:05Z) - A Two-Phase Approach for Abstractive Podcast Summarization [18.35061145103997]
podcast summarization is different from summarization of other data formats.
We propose a two-phase approach: sentence selection and seq2seq learning.
Our approach achieves promising results regarding both ROUGE-based measures and human evaluations.
arXiv Detail & Related papers (2020-11-16T21:31:28Z) - PodSumm -- Podcast Audio Summarization [0.0]
We propose a method to automatically construct a podcast summary via guidance from the text-domain.
Motivated by a lack of datasets for this task, we curate an internal dataset, find an effective scheme for data augmentation, and design a protocol to gather summaries from annotators.
Our method achieves ROUGE-F(1/2/L) scores of 0.63/0.53/0.63 on our dataset.
arXiv Detail & Related papers (2020-09-22T04:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.