Related papers: On Barriers to Archival Audio Processing

On Barriers to Archival Audio Processing

URL: http://arxiv.org/abs/2507.08768v1
Date: Fri, 11 Jul 2025 17:27:11 GMT
Title: On Barriers to Archival Audio Processing
Authors: Peter Sullivan, Muhammad Abdul-Mageed,
Abstract summary: We leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods.<n>Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech.<n>However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language.
Score: 16.244692109502726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods, especially with respect to the impact of multilingual speakers and cross-age recordings. Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech. However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language. Issues which will need to be overcome should archives aim to employ SR methods for speaker indexing.

Related papers

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems. We introduce spoken language understanding modules to extract speaker-related semantic information. We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z)
DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z)
Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings. Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z)
CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis [7.6883773606941075]
CrossSpeech improves the quality of cross-lingual speech by effectively disentangling speaker and language information. From the experiments, we verify that CrossSpeech achieves significant improvements in cross-lingual TTS.
arXiv Detail & Related papers (2023-02-28T07:51:10Z)
Streaming Multi-talker Speech Recognition with Joint Speaker Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
U-vectors: Generating clusterable speaker embedding from unlabeled data [0.0]
This paper introduces a speaker recognition strategy dealing with unlabeled data. It generates clusterable embedding vectors from small fixed-size speech frames. We conclude that the proposed approach achieves remarkable performance using pairwise architectures.
arXiv Detail & Related papers (2021-02-07T18:00:09Z)
A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity. With the rise of deep learning technology, more rapid advancements have been made for speaker diarization. We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z)
Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition [31.808145263757105]
We use CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech. We evaluate this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.
arXiv Detail & Related papers (2020-05-19T07:35:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.