Related papers: Assessing the Impact of Speaker Identity in Speech Spoofing Detection

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

URL: http://arxiv.org/abs/2602.20805v1
Date: Tue, 24 Feb 2026 11:45:41 GMT
Title: Assessing the Impact of Speaker Identity in Speech Spoofing Detection
Authors: Anh-Tuan Dao, Driss Matrouf, Nicholas Evans,
Abstract summary: Spoofing detection systems are typically trained using diverse recordings from multiple speakers.<n>In this paper, we investigate the impact of speaker information on spoofing detection systems.<n>We propose two approaches within our Speaker-Invariant Multi-Task framework.
Score: 1.7816843507516946
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity. However, this assumption remains unverified. In this paper, we investigate the impact of speaker information on spoofing detection systems. We propose two approaches within our Speaker-Invariant Multi-Task framework, one that models speaker identity within the embeddings and another that removes it. SInMT integrates multi-task learning for joint speaker recognition and spoofing detection, incorporating a gradient reversal layer. Evaluated using four datasets, our speaker-invariant model reduces the average equal error rate by 17% compared to the baseline, with up to 48% reduction for the most challenging attacks (e.g., A11).

Related papers

Evaluating Identity Leakage in Speaker De-Identification Systems [1.7699344561127388]
Speaker de-identification aims to conceal a speaker's identity while preserving intelligibility of the underlying speech.<n>We introduce a benchmark that quantifies residual identity leakage with three complementary error rates.<n> Evaluation results reveal that all state-of-the-art speaker de-identification systems leak identity information.
arXiv Detail & Related papers (2025-08-19T17:20:25Z)
Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings [52.985061676464554]
We propose a Knowledge Distillation based training approach for short context speaker embedding extraction.<n>We leverage the spatial information of the speaker of interest using beamforming to reduce overlap.<n>Results demonstrate that our models are effective at short-context embedding extraction and more robust to overlap.
arXiv Detail & Related papers (2025-08-18T11:32:13Z)
Investigating Confidence Estimation Measures for Speaker Diarization [4.679826697518427]
Speaker diarization systems segment a conversation recording based on the speakers' identity. Speaker diarization errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity. One way to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems.
arXiv Detail & Related papers (2024-06-24T20:21:38Z)
Symmetric Saliency-based Adversarial Attack To Speaker Identification [17.087523686496958]
We propose a novel generation-network-based approach, called symmetric saliency-based encoder-decoder (SSED) First, it uses a novel saliency map decoder to learn the importance of speech samples to the decision of a targeted speaker identification system. Second, it proposes an angular loss function to push the speaker embedding far away from the source speaker.
arXiv Detail & Related papers (2022-10-30T08:54:02Z)
In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z)
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z)
Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech. This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training. Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z)
Disentangled dimensionality reduction for noise-robust speaker diarisation [30.383712356205084]
Speaker embeddings play a crucial role in the performance of diarisation systems. Speaker embeddings often capture spurious information such as noise and reverberation, adversely affecting performance. We propose a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings. We also propose the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise.
arXiv Detail & Related papers (2021-10-07T12:19:09Z)
FoolHD: Fooling speaker identification by Highly imperceptible adversarial Disturbances [63.80959552818541]
We propose a white-box steganography-inspired adversarial attack that generates imperceptible perturbations against a speaker identification model. Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in the DCT domain and is trained with a multi-objective loss function. We validate FoolHD with a 250-speaker identification x-vector network, trained using VoxCeleb, in terms of accuracy, success rate, and imperceptibility.
arXiv Detail & Related papers (2020-11-17T07:38:26Z)
Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system. We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z)
Integrated Replay Spoofing-aware Text-independent Speaker Verification [47.41124427552161]
We propose two approaches for building an integrated system of speaker verification and presentation attack detection. The first approach simultaneously trains speaker identification, presentation attack detection, and the integrated system using multi-task learning. We propose a back-end modular approach using a separate deep neural network (DNN) for speaker verification and presentation attack detection.
arXiv Detail & Related papers (2020-06-10T01:24:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.