The Impact of Automatic Speech Transcription on Speaker Attribution
- URL: http://arxiv.org/abs/2507.08660v1
- Date: Fri, 11 Jul 2025 15:00:32 GMT
- Title: The Impact of Automatic Speech Transcription on Speaker Attribution
- Authors: Cristina Aggazzotti, Matthew Wiesner, Elizabeth Allyn Smith, Nicholas Andrews,
- Abstract summary: This paper is the first comprehensive study of the impact of automatic speech recognition on speaker attribution performance.<n>We find that attribution is surprisingly resilient to word-level transcription errors.<n>Our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data.
- Score: 7.588876479279435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
Related papers
- ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification [48.98768967435808]
We use computational method to verify if an utterance matches the identity of an enrolled speaker.<n>Despite much success, we have yet to develop a speaker verification system that offers explainable results.<n>A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait.
arXiv Detail & Related papers (2025-01-10T05:53:37Z) - MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.<n>This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z) - Quantification of stylistic differences in human- and ASR-produced transcripts of African American English [1.8021379035665333]
Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation.
We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English speech.
We investigate the interactions of these categories with how well transcripts can be compared via word error rate.
arXiv Detail & Related papers (2024-09-04T20:18:59Z) - Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? [4.148732457277201]
Authorship verification is the task of determining if two distinct writing samples share the same author.<n>In this paper, we explore the attribution of transcribed speech, which poses novel challenges.<n>We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts.
arXiv Detail & Related papers (2023-11-13T18:54:17Z) - Investigating the Sensitivity of Automatic Speech Recognition Systems to
Phonetic Variation in L2 Englishes [3.198144010381572]
This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes.
It is demonstrated that the behaviour of the ASR is systematic and consistent across speakers with similar spoken varieties.
arXiv Detail & Related papers (2023-05-12T11:29:13Z) - Improving Self-Supervised Speech Representations by Disentangling
Speakers [56.486084431528695]
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus.
Disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well.
We propose a new SSL method that can achieve speaker disentanglement without severe loss of content.
arXiv Detail & Related papers (2022-04-20T04:56:14Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.