SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions
- URL: http://arxiv.org/abs/2509.17091v2
- Date: Mon, 29 Sep 2025 13:46:37 GMT
- Title: SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions
- Authors: Massa Baali, Sarthak Bisht, Francisco Teixeira, Kateryna Shapovalenko, Rita Singh, Bhiksha Raj,
- Abstract summary: Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems.<n>Existing benchmarks evaluate only subsets of these conditions, missing others entirely.<n>We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks.
- Score: 54.34001921326444
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.
Related papers
- AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition [72.24180896265192]
We introduce AgentNoiseBench, a framework for evaluating robustness of agentic models under noisy environments.<n>We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios.<n>We then categorize environmental noise into two primary types: user-noise and tool-noise.<n>Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks.
arXiv Detail & Related papers (2026-02-11T20:33:10Z) - Lost in the Noise: How Reasoning Models Fail with Contextual Distractors [57.31788955167306]
Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information.<n>We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks.<n>Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors.
arXiv Detail & Related papers (2026-01-12T05:43:51Z) - Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification [55.56234913868664]
We propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) for reliable learning on multimodal data.<n>The proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
arXiv Detail & Related papers (2026-01-12T03:14:12Z) - Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations [40.12950482269347]
We present PromptSE, a framework that creates semantically equivalent prompt variants with emotion and personality templates.<n>Our study shows that performance and stability behave as largely decoupled optimization objectives.<n>PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection.
arXiv Detail & Related papers (2025-09-17T04:17:42Z) - TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis [74.31705485094096]
We introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers.<n>TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail.<n>We construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes.
arXiv Detail & Related papers (2025-08-19T08:31:15Z) - Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM-generated Text Detection [0.38233569758620056]
We present a novel evaluation paradigm for AI text detectors that prioritizes real-world and equitable assessment.<n>Our benchmark, SHIELD, addresses these limitations by integrating both reliability and stability factors into a unified evaluation metric.<n>We develop a model-agnostic humanification framework that modifies AI text to more closely resemble human authorship, incorporating a controllable hardness parameter.
arXiv Detail & Related papers (2025-07-21T06:37:27Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.<n>We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.<n>We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Advancing Test-Time Adaptation in Wild Acoustic Test Settings [26.05732574338255]
Speech signals follow short-term consistency, requiring specialized adaptation strategies.
We propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models.
Our approach outperforms existing baselines under various wild acoustic test settings.
arXiv Detail & Related papers (2023-10-14T06:22:08Z) - Assessing the Generalization Gap of Learning-Based Speech Enhancement
Systems in Noisy and Reverberant Environments [0.7366405857677227]
Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or room impulse response database.
The present study introduces a generalization assessment framework that uses a reference model trained on the test condition.
The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), ConvTasNet, DCCRN and MANNER.
arXiv Detail & Related papers (2023-09-12T12:51:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.