VoxWatch: An open-set speaker recognition benchmark on VoxCeleb
- URL: http://arxiv.org/abs/2307.00169v1
- Date: Fri, 30 Jun 2023 23:11:38 GMT
- Title: VoxWatch: An open-set speaker recognition benchmark on VoxCeleb
- Authors: Raghuveer Peri and Seyed Omid Sadjadi and Daniel Garcia-Romero
- Abstract summary: Open-set speaker identification (OSI) deals with determining if a test speech sample belongs to a speaker from a set of pre-enrolled individuals (in-set) or if it is from an out-of-set speaker.
As the size of the in-set speaker population grows, the out-of-set scores become larger, leading to increased false alarm rates.
We present the first public benchmark for OSI, developed using the VoxCeleb dataset.
- Score: 10.84962993456577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite its broad practical applications such as in fraud prevention,
open-set speaker identification (OSI) has received less attention in the
speaker recognition community compared to speaker verification (SV). OSI deals
with determining if a test speech sample belongs to a speaker from a set of
pre-enrolled individuals (in-set) or if it is from an out-of-set speaker. In
addition to the typical challenges associated with speech variability, OSI is
prone to the "false-alarm problem"; as the size of the in-set speaker
population (a.k.a watchlist) grows, the out-of-set scores become larger,
leading to increased false alarm rates. This is in particular challenging for
applications in financial institutions and border security where the watchlist
size is typically of the order of several thousand speakers. Therefore, it is
important to systematically quantify the false-alarm problem, and develop
techniques that alleviate the impact of watchlist size on detection
performance. Prior studies on this problem are sparse, and lack a common
benchmark for systematic evaluations. In this paper, we present the first
public benchmark for OSI, developed using the VoxCeleb dataset. We quantify the
effect of the watchlist size and speech duration on the watchlist-based speaker
detection task using three strong neural network based systems. In contrast to
the findings from prior research, we show that the commonly adopted adaptive
score normalization is not guaranteed to improve the performance for this task.
On the other hand, we show that score calibration and score fusion, two other
commonly used techniques in SV, result in significant improvements in OSI
performance.
Related papers
- Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Meta-Learning Framework for End-to-End Imposter Identification in Unseen
Speaker Recognition [4.143603294943441]
We show the problem of generalization using fixed thresholds (computed using EER metric) for imposter identification in unseen speaker recognition.
We then introduce a robust speaker-specific thresholding technique for better performance.
We show the efficacy of the proposed techniques on VoxCeleb1, VCTK and the FFSVC 2022 datasets, beating the baselines by up to 10%.
arXiv Detail & Related papers (2023-06-01T17:49:58Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.