PodSumm -- Podcast Audio Summarization
- URL: http://arxiv.org/abs/2009.10315v1
- Date: Tue, 22 Sep 2020 04:49:33 GMT
- Title: PodSumm -- Podcast Audio Summarization
- Authors: Aneesh Vartakavi and Amanmeet Garg
- Abstract summary: We propose a method to automatically construct a podcast summary via guidance from the text-domain.
Motivated by a lack of datasets for this task, we curate an internal dataset, find an effective scheme for data augmentation, and design a protocol to gather summaries from annotators.
Our method achieves ROUGE-F(1/2/L) scores of 0.63/0.53/0.63 on our dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The diverse nature, scale, and specificity of podcasts present a unique
challenge to content discovery systems. Listeners often rely on text
descriptions of episodes provided by the podcast creators to discover new
content. Some factors like the presentation style of the narrator and
production quality are significant indicators of subjective user preference but
are difficult to quantify and not reflected in the text descriptions provided
by the podcast creators. We propose the automated creation of podcast audio
summaries to aid in content discovery and help listeners to quickly preview
podcast content before investing time in listening to an entire episode. In
this paper, we present a method to automatically construct a podcast summary
via guidance from the text-domain. Our method performs two key steps, namely,
audio to text transcription and text summary generation. Motivated by a lack of
datasets for this task, we curate an internal dataset, find an effective scheme
for data augmentation, and design a protocol to gather summaries from
annotators. We fine-tune a PreSumm[10] model with our augmented dataset and
perform an ablation study. Our method achieves ROUGE-F(1/2/L) scores of
0.63/0.53/0.63 on our dataset. We hope these results may inspire future
research in this direction.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus [23.70786221902932]
We introduce a massive dataset of over 1.1M podcast transcripts available through public RSS feeds from May and June of 2020.
This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes.
Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this popular impactful medium.
arXiv Detail & Related papers (2024-11-12T15:56:48Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Topic Modeling on Podcast Short-Text Metadata [0.9539495585692009]
We assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using modeling techniques for short text.
We propose a new strategy to named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization modeling framework.
Our experiments on two existing datasets from Spotify and iTunes and Deezer, show that our proposed document representation, NEiCE, leads to improved coherence over the baselines.
arXiv Detail & Related papers (2022-01-12T11:07:05Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Spotify at TREC 2020: Genre-Aware Abstractive Podcast Summarization [4.456617185465443]
The goal of this challenge was to generate short, informative summaries that contain the key information present in a podcast episode.
We propose two summarization models that explicitly take genre and named entities into consideration.
Our models are abstractive, and supervised using creator-provided descriptions as ground truth summaries.
arXiv Detail & Related papers (2021-04-07T18:27:28Z) - Detecting Extraneous Content in Podcasts [6.335863593761816]
We present a model that leverage both textual and listening patterns to detect extraneous content in podcast descriptions and audio transcripts.
We show that our models can substantively improve ROUGE scores and reduce the extraneous content generated in the summaries.
arXiv Detail & Related papers (2021-03-03T18:30:50Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - A Baseline Analysis for Podcast Abstractive Summarization [18.35061145103997]
This paper presents a baseline analysis of podcast summarization using the Spotify Podcast dataset.
It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
arXiv Detail & Related papers (2020-08-24T18:38:42Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.