Design Guidelines for Inclusive Speaker Verification Evaluation Datasets
- URL: http://arxiv.org/abs/2204.02281v2
- Date: Tue, 13 Sep 2022 13:05:52 GMT
- Title: Design Guidelines for Inclusive Speaker Verification Evaluation Datasets
- Authors: Wiebke Toussaint Hutiri, Lauriane Gorce, Aaron Yi Ding
- Abstract summary: Speaker verification (SV) provides billions of voice-enabled devices with access control, and ensures the security of voice-driven technologies.
Current SV evaluation practices are insufficient for evaluating bias: they are over-simplified and aggregate users, not representative of real-life usage scenarios.
This paper proposes design guidelines for constructing SV evaluation datasets that address these short-comings.
- Score: 0.6015898117103067
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker verification (SV) provides billions of voice-enabled devices with
access control, and ensures the security of voice-driven technologies. As a
type of biometrics, it is necessary that SV is unbiased, with consistent and
reliable performance across speakers irrespective of their demographic, social
and economic attributes. Current SV evaluation practices are insufficient for
evaluating bias: they are over-simplified and aggregate users, not
representative of real-life usage scenarios, and consequences of errors are not
accounted for. This paper proposes design guidelines for constructing SV
evaluation datasets that address these short-comings. We propose a schema for
grading the difficulty of utterance pairs, and present an algorithm for
generating inclusive SV datasets. We empirically validate our proposed method
in a set of experiments on the VoxCeleb1 dataset. Our results confirm that the
count of utterance pairs/speaker, and the difficulty grading of utterance pairs
have a significant effect on evaluation performance and variability. Our work
contributes to the development of SV evaluation practices that are inclusive
and fair.
Related papers
- Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help? [0.0]
We show that mitigating data imbalance can significantly improve the predictive performance of models for all the Common Vulnerability Scoring System (CVSS) tasks.
We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board.
arXiv Detail & Related papers (2024-07-15T13:47:55Z) - Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models [52.368110271614285]
We introduce AdvEval, a novel black-box adversarial framework against NLG evaluators.
AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators.
We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation.
arXiv Detail & Related papers (2024-05-23T14:48:15Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data.
We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths.
We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z) - Toward Practical Automatic Speech Recognition and Post-Processing: a
Call for Explainable Error Benchmark Guideline [12.197453599489963]
We propose the development of an Error Explainable Benchmark (EEB) dataset.
This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings.
Our proposition provides a structured pathway for a more real-world-centric' evaluation, allowing for the detection and rectification of nuanced system weaknesses.
arXiv Detail & Related papers (2024-01-26T03:42:45Z) - SVVAD: Personal Voice Activity Detection for Speaker Verification [24.57668015470307]
We propose a speaker verification-based voice activity detection (SVVAD) framework that can adapt the speech features according to which are most informative for speaker verification (SV)
experiments show that SVVAD significantly outperforms the baseline in terms of equal error rate (EER) under conditions where other speakers are mixed at different ratios.
arXiv Detail & Related papers (2023-05-31T05:59:33Z) - Towards single integrated spoofing-aware speaker verification embeddings [63.42889348690095]
This study aims to develop a single integrated spoofing-aware speaker verification embeddings.
We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data.
Experiments show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.
arXiv Detail & Related papers (2023-05-30T14:15:39Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - SVEva Fair: A Framework for Evaluating Fairness in Speaker Verification [1.2437226707039446]
Speaker verification is a form of biometric identification that gives access to voice assistants.
Due to a lack of fairness metrics, little is known about how model performance varies across subgroups.
We develop SVEva Fair, an accessible, actionable and model-agnostic framework for evaluating the fairness of speaker verification components.
arXiv Detail & Related papers (2021-07-26T09:15:46Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.