Identifying Introductions in Podcast Episodes from Automatically
Generated Transcripts
- URL: http://arxiv.org/abs/2110.07096v1
- Date: Thu, 14 Oct 2021 00:34:51 GMT
- Title: Identifying Introductions in Podcast Episodes from Automatically
Generated Transcripts
- Authors: Elise Jing, Kristiana Schneck, Dennis Egan, Scott A. Waterman
- Abstract summary: We build a novel dataset of complete transcriptions of over 400 podcast episodes.
These introductions contain information about the episodes' topics, hosts, and guests.
We train three Transformer models based on the pre-trained BERT and different augmentation strategies.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the volume of long-form spoken-word content such as podcasts explodes,
many platforms desire to present short, meaningful, and logically coherent
segments extracted from the full content. Such segments can be consumed by
users to sample content before diving in, as well as used by the platform to
promote and recommend content. However, little published work is focused on the
segmentation of spoken-word content, where the errors (noise) in transcripts
generated by automatic speech recognition (ASR) services poses many challenges.
Here we build a novel dataset of complete transcriptions of over 400 podcast
episodes, in which we label the position of introductions in each episode.
These introductions contain information about the episodes' topics, hosts, and
guests, providing a valuable summary of the episode content, as it is created
by the authors. We further augment our dataset with word substitutions to
increase the amount of available training data. We train three Transformer
models based on the pre-trained BERT and different augmentation strategies,
which achieve significantly better performance compared with a static embedding
model, showing that it is possible to capture generalized, larger-scale
structural information from noisy, loosely-organized speech data. This is
further demonstrated through an analysis of the models' inner architecture. Our
methods and dataset can be used to facilitate future work on the
structure-based segmentation of spoken-word content.
Related papers
- From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Topic Modeling on Podcast Short-Text Metadata [0.9539495585692009]
We assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using modeling techniques for short text.
We propose a new strategy to named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization modeling framework.
Our experiments on two existing datasets from Spotify and iTunes and Deezer, show that our proposed document representation, NEiCE, leads to improved coherence over the baselines.
arXiv Detail & Related papers (2022-01-12T11:07:05Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Detecting Extraneous Content in Podcasts [6.335863593761816]
We present a model that leverage both textual and listening patterns to detect extraneous content in podcast descriptions and audio transcripts.
We show that our models can substantively improve ROUGE scores and reduce the extraneous content generated in the summaries.
arXiv Detail & Related papers (2021-03-03T18:30:50Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - PodSumm -- Podcast Audio Summarization [0.0]
We propose a method to automatically construct a podcast summary via guidance from the text-domain.
Motivated by a lack of datasets for this task, we curate an internal dataset, find an effective scheme for data augmentation, and design a protocol to gather summaries from annotators.
Our method achieves ROUGE-F(1/2/L) scores of 0.63/0.53/0.63 on our dataset.
arXiv Detail & Related papers (2020-09-22T04:49:33Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.