Related papers: Topic Modeling on Podcast Short-Text Metadata

Topic Modeling on Podcast Short-Text Metadata

URL: http://arxiv.org/abs/2201.04419v1
Date: Wed, 12 Jan 2022 11:07:05 GMT
Title: Topic Modeling on Podcast Short-Text Metadata
Authors: Francisco B. Valero and Marion Baranes and Elena V. Epure
Abstract summary: We assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using modeling techniques for short text. We propose a new strategy to named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization modeling framework. Our experiments on two existing datasets from Spotify and iTunes and Deezer, show that our proposed document representation, NEiCE, leads to improved coherence over the baselines.
Score: 0.9539495585692009
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Podcasts have emerged as a massively consumed online content, notably due to wider accessibility of production means and scaled distribution through large streaming platforms. Categorization systems and information access technologies typically use topics as the primary way to organize or navigate podcast collections. However, annotating podcasts with topics is still quite problematic because the assigned editorial genres are broad, heterogeneous or misleading, or because of data challenges (e.g. short metadata text, noisy transcripts). Here, we assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using topic modeling techniques for short text. We also propose a new strategy to leverage named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization (NMF) topic modeling framework. Our experiments on two existing datasets from Spotify and iTunes and Deezer, a new dataset from an online service providing a catalog of podcasts, show that our proposed document representation, NEiCE, leads to improved topic coherence over the baselines. We release the code for experimental reproducibility of the results.

Related papers

MoonCast: High-Quality Zero-Shot Podcast Generation [81.29927724674602]
MoonCast is a solution for high-quality zero-shot podcast generation. It aims to synthesize natural podcast-style speech from text-only sources. Experiments demonstrate that MoonCast outperforms baselines.
arXiv Detail & Related papers (2025-03-18T15:25:08Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus [23.70786221902932]
We introduce a massive dataset of over 1.1M podcast transcripts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this popular impactful medium.
arXiv Detail & Related papers (2024-11-12T15:56:48Z)
TopicGPT: A Prompt-based Topic Modeling Framework [77.72072691307811]
We introduce TopicGPT, a prompt-based framework that uses large language models to uncover latent topics in a text collection. It produces topics that align better with human categorizations compared to competing methods. Its topics are also interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.
arXiv Detail & Related papers (2023-11-02T17:57:10Z)
Topic Taxonomy Expansion via Hierarchy-Aware Topic Phrase Generation [58.3921103230647]
We propose a novel framework for topic taxonomy expansion, named TopicExpan. TopicExpan directly generates topic-related terms belonging to new topics. Experimental results on two real-world text corpora show that TopicExpan significantly outperforms other baseline methods in terms of the quality of output.
arXiv Detail & Related papers (2022-10-18T22:38:49Z)
Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts [0.0]
We build a novel dataset of complete transcriptions of over 400 podcast episodes. These introductions contain information about the episodes' topics, hosts, and guests. We train three Transformer models based on the pre-trained BERT and different augmentation strategies.
arXiv Detail & Related papers (2021-10-14T00:34:51Z)
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining [61.82562838486632]
We crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data.
arXiv Detail & Related papers (2021-06-01T22:17:13Z)
Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders [59.038157066874255]
We propose a novel framework called RankAE to perform chat summarization without employing manually labeled data. RankAE consists of a topic-oriented ranking strategy that selects topic utterances according to centrality and diversity simultaneously. A denoising auto-encoder is designed to generate succinct but context-informative summaries based on the selected utterances.
arXiv Detail & Related papers (2020-12-14T07:31:17Z)
A Two-Phase Approach for Abstractive Podcast Summarization [18.35061145103997]
podcast summarization is different from summarization of other data formats. We propose a two-phase approach: sentence selection and seq2seq learning. Our approach achieves promising results regarding both ROUGE-based measures and human evaluations.
arXiv Detail & Related papers (2020-11-16T21:31:28Z)
PodSumm -- Podcast Audio Summarization [0.0]
We propose a method to automatically construct a podcast summary via guidance from the text-domain. Motivated by a lack of datasets for this task, we curate an internal dataset, find an effective scheme for data augmentation, and design a protocol to gather summaries from annotators. Our method achieves ROUGE-F(1/2/L) scores of 0.63/0.53/0.63 on our dataset.
arXiv Detail & Related papers (2020-09-22T04:49:33Z)
A Baseline Analysis for Podcast Abstractive Summarization [18.35061145103997]
This paper presents a baseline analysis of podcast summarization using the Spotify Podcast dataset. It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
arXiv Detail & Related papers (2020-08-24T18:38:42Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [81.33107307509718]
We propose a topic adaptive storyteller to model the ability of inter-topic generalization. We also propose a prototype encoding structure to model the ability of intra-topic derivation. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model.
arXiv Detail & Related papers (2020-08-11T03:55:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.