Related papers: Design Guidelines for Inclusive Speaker Verification Evaluation Datasets

Design Guidelines for Inclusive Speaker Verification Evaluation Datasets

URL: http://arxiv.org/abs/2204.02281v2
Date: Tue, 13 Sep 2022 13:05:52 GMT
Title: Design Guidelines for Inclusive Speaker Verification Evaluation Datasets
Authors: Wiebke Toussaint Hutiri, Lauriane Gorce, Aaron Yi Ding
Abstract summary: Speaker verification (SV) provides billions of voice-enabled devices with access control, and ensures the security of voice-driven technologies. Current SV evaluation practices are insufficient for evaluating bias: they are over-simplified and aggregate users, not representative of real-life usage scenarios. This paper proposes design guidelines for constructing SV evaluation datasets that address these short-comings.
Score: 0.6015898117103067
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speaker verification (SV) provides billions of voice-enabled devices with access control, and ensures the security of voice-driven technologies. As a type of biometrics, it is necessary that SV is unbiased, with consistent and reliable performance across speakers irrespective of their demographic, social and economic attributes. Current SV evaluation practices are insufficient for evaluating bias: they are over-simplified and aggregate users, not representative of real-life usage scenarios, and consequences of errors are not accounted for. This paper proposes design guidelines for constructing SV evaluation datasets that address these short-comings. We propose a schema for grading the difficulty of utterance pairs, and present an algorithm for generating inclusive SV datasets. We empirically validate our proposed method in a set of experiments on the VoxCeleb1 dataset. Our results confirm that the count of utterance pairs/speaker, and the difficulty grading of utterance pairs have a significant effect on evaluation performance and variability. Our work contributes to the development of SV evaluation practices that are inclusive and fair.

Related papers

EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment [17.74561647070259]
We introduce EvalSVA, a multi-agent evaluators team to autonomously deliberate and evaluate various aspects of software vulnerability (SV) assessment. EvalSVA offers a human-like process and generates both reason and answer for SV assessment.
arXiv Detail & Related papers (2024-12-11T08:00:50Z)
Towards Flexible Evaluation for Generative Visual Question Answering [17.271448204525612]
This paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on Visual Question Answering (VQA) datasets. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.
arXiv Detail & Related papers (2024-08-01T05:56:34Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context. We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data. We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z)
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline [12.197453599489963]
We propose the development of an Error Explainable Benchmark (EEB) dataset. This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings. Our proposition provides a structured pathway for a more real-world-centric' evaluation, allowing for the detection and rectification of nuanced system weaknesses.
arXiv Detail & Related papers (2024-01-26T03:42:45Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
SVVAD: Personal Voice Activity Detection for Speaker Verification [24.57668015470307]
We propose a speaker verification-based voice activity detection (SVVAD) framework that can adapt the speech features according to which are most informative for speaker verification (SV) experiments show that SVVAD significantly outperforms the baseline in terms of equal error rate (EER) under conditions where other speakers are mixed at different ratios.
arXiv Detail & Related papers (2023-05-31T05:59:33Z)
Towards single integrated spoofing-aware speaker verification embeddings [63.42889348690095]
This study aims to develop a single integrated spoofing-aware speaker verification embeddings. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data. Experiments show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.
arXiv Detail & Related papers (2023-05-30T14:15:39Z)
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z)
SVEva Fair: A Framework for Evaluating Fairness in Speaker Verification [1.2437226707039446]
Speaker verification is a form of biometric identification that gives access to voice assistants. Due to a lack of fairness metrics, little is known about how model performance varies across subgroups. We develop SVEva Fair, an accessible, actionable and model-agnostic framework for evaluating the fairness of speaker verification components.
arXiv Detail & Related papers (2021-07-26T09:15:46Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.