Topic Model Robustness to Automatic Speech Recognition Errors in Podcast
Transcripts
- URL: http://arxiv.org/abs/2109.12306v1
- Date: Sat, 25 Sep 2021 07:59:31 GMT
- Title: Topic Model Robustness to Automatic Speech Recognition Errors in Podcast
Transcripts
- Authors: Raluca Alexandra Fetic, Mikkel Jordahn, Lucas Chaves Lima, Rasmus Arpe
Fogh Egeb{\ae}k, Martin Carsten Nielsen, Benjamin Biering, Lars Kai Hansen
- Abstract summary: In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine.
First, we observe a baseline of cosine similarity scores between topic embeddings from automatic transcriptions and the descriptions of the podcasts written by the podcast creators.
We then observe how the cosine similarities decrease as transcription noise increases and conclude that even when automatic speech recognition transcripts are erroneous, it is still possible to obtain high-quality topic embeddings from the transcriptions.
- Score: 4.526933031343007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For a multilingual podcast streaming service, it is critical to be able to
deliver relevant content to all users independent of language. Podcast content
relevance is conventionally determined using various metadata sources. However,
with the increasing quality of speech recognition in many languages, utilizing
automatic transcriptions to provide better content recommendations becomes
possible. In this work, we explore the robustness of a Latent Dirichlet
Allocation topic model when applied to transcripts created by an automatic
speech recognition engine. Specifically, we explore how increasing
transcription noise influences topics obtained from transcriptions in Danish; a
low resource language. First, we observe a baseline of cosine similarity scores
between topic embeddings from automatic transcriptions and the descriptions of
the podcasts written by the podcast creators. We then observe how the cosine
similarities decrease as transcription noise increases and conclude that even
when automatic speech recognition transcripts are erroneous, it is still
possible to obtain high-quality topic embeddings from the transcriptions.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions [0.5120567378386615]
We fine-tune the model to produce more verbatim speech transcriptions.
We employ several techniques to increase robustness against multiple speakers and background noise.
arXiv Detail & Related papers (2024-08-29T14:52:42Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling [62.25533750469467]
We propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified.
We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs.
We envision this system being useful for the automatic generation of subtitles to improve the accessibility of videos available on modern streaming services.
arXiv Detail & Related papers (2024-01-22T15:26:01Z) - Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? [4.148732457277201]
Authorship verification is the task of determining if two distinct writing samples share the same author.
In this paper, we explore the attribution of transcribed speech, which poses novel challenges.
We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts.
arXiv Detail & Related papers (2023-11-13T18:54:17Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Identifying Introductions in Podcast Episodes from Automatically
Generated Transcripts [0.0]
We build a novel dataset of complete transcriptions of over 400 podcast episodes.
These introductions contain information about the episodes' topics, hosts, and guests.
We train three Transformer models based on the pre-trained BERT and different augmentation strategies.
arXiv Detail & Related papers (2021-10-14T00:34:51Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Improving Accent Conversion with Reference Encoder and End-To-End
Text-To-Speech [23.30022534796909]
Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.
We propose approaches to improving accent conversion applicability, as well as quality.
arXiv Detail & Related papers (2020-05-19T08:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.