Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
- URL: http://arxiv.org/abs/2411.07892v1
- Date: Tue, 12 Nov 2024 15:56:48 GMT
- Title: Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
- Authors: Benjamin Litterer, David Jurgens, Dallas Card,
- Abstract summary: We introduce a massive dataset of over 1.1M podcast transcripts available through public RSS feeds from May and June of 2020.
This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes.
Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this popular impactful medium.
- Score: 23.70786221902932
- License:
- Abstract: Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
Related papers
- Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR.
We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z) - YODAS: Youtube-Oriented Dataset for Audio and Speech [47.60574092241447]
YODAS is a large-scale, multilingual dataset comprising over 500k hours of speech data in more than 100 languages.
The labeled subsets, including manual or automatic subtitles, facilitate supervised model training.
YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license.
arXiv Detail & Related papers (2024-06-02T23:43:27Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - NewsPod: Automatic and Interactive News Podcasts [18.968547560235347]
NewsPod is an automatically generated, interactive news podcast.
The podcast is divided into segments, each centered on a news event, with each segment structured as a Question and Answer conversation.
A novel aspect of NewsPod allows listeners to interact with the podcast by asking their own questions and receiving automatically generated answers.
arXiv Detail & Related papers (2022-02-15T02:37:04Z) - Topic Modeling on Podcast Short-Text Metadata [0.9539495585692009]
We assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using modeling techniques for short text.
We propose a new strategy to named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization modeling framework.
Our experiments on two existing datasets from Spotify and iTunes and Deezer, show that our proposed document representation, NEiCE, leads to improved coherence over the baselines.
arXiv Detail & Related papers (2022-01-12T11:07:05Z) - Identifying Introductions in Podcast Episodes from Automatically
Generated Transcripts [0.0]
We build a novel dataset of complete transcriptions of over 400 podcast episodes.
These introductions contain information about the episodes' topics, hosts, and guests.
We train three Transformer models based on the pre-trained BERT and different augmentation strategies.
arXiv Detail & Related papers (2021-10-14T00:34:51Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - PodSumm -- Podcast Audio Summarization [0.0]
We propose a method to automatically construct a podcast summary via guidance from the text-domain.
Motivated by a lack of datasets for this task, we curate an internal dataset, find an effective scheme for data augmentation, and design a protocol to gather summaries from annotators.
Our method achieves ROUGE-F(1/2/L) scores of 0.63/0.53/0.63 on our dataset.
arXiv Detail & Related papers (2020-09-22T04:49:33Z) - A Baseline Analysis for Podcast Abstractive Summarization [18.35061145103997]
This paper presents a baseline analysis of podcast summarization using the Spotify Podcast dataset.
It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
arXiv Detail & Related papers (2020-08-24T18:38:42Z) - Spot the conversation: speaker diarisation in the wild [108.61222789195209]
We propose an automatic audio-visual diarisation method for YouTube videos.
Second, we integrate our method into a semi-automatic dataset creation pipeline.
Third, we use this pipeline to create a large-scale diarisation dataset called VoxConverse.
arXiv Detail & Related papers (2020-07-02T15:55:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.