Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis
- URL: http://arxiv.org/abs/2507.02176v1
- Date: Wed, 02 Jul 2025 22:16:42 GMT
- Title: Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis
- Authors: Marc-André Carbonneau, Benjamin van Niekerk, Hugo Seuté, Jean-Philippe Letendre, Herman Kamper, Julian Zaïdi,
- Abstract summary: In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings.<n>We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm.<n>To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns.
- Score: 20.80178325643714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.
Related papers
- AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z) - Speaker Embeddings to Improve Tracking of Intermittent and Moving Speakers [53.12031345322412]
We propose to perform identity reassignment post-tracking, using speaker embeddings.<n>Beamforming is used to enhance the signal towards the speakers' positions in order to compute speaker embeddings.<n>We evaluate the performance of the proposed speaker embedding-based identity reassignment method on a dataset where speakers change position during inactivity periods.
arXiv Detail & Related papers (2025-06-23T13:02:20Z) - CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition [49.27067541740956]
We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information.<n>CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples.<n>Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
arXiv Detail & Related papers (2025-06-06T13:25:56Z) - VANPY: Voice Analysis Framework [0.0]
We develop the VANPY framework for automated pre-processing, feature extraction, and classification of voice data.<n>Four of the framework's components were developed in-house and integrated into the framework to extend speaker characterization capabilities.<n>We demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie "Pulp Fiction"
arXiv Detail & Related papers (2025-02-17T21:12:57Z) - Improving speaker verification robustness with synthetic emotional utterances [14.63248006004598]
A speaker verification (SV) system offers an authentication service designed to confirm whether a given speech sample originates from a specific speaker.<n>Previous models exhibit high error rates when dealing with emotional utterances compared to neutral ones.<n>This issue primarily stems from the limited availability of labeled emotional speech data.<n>We propose a novel approach employing the CycleGAN framework to serve as a data augmentation method.
arXiv Detail & Related papers (2024-11-30T02:18:26Z) - Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization [0.5497663232622965]
This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE)
It is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent.
Findings indicate that this method outperforms most baseline techniques in preserving emotional information.
arXiv Detail & Related papers (2024-09-24T08:55:10Z) - Evaluating Speaker Identity Coding in Self-supervised Models and Humans [0.42303492200814446]
Speaker identity plays a significant role in human communication and is being increasingly used in societal applications.
We show that self-supervised representations from different families are significantly better for speaker identification over acoustic representations.
We also show that such a speaker identification task can be used to better understand the nature of acoustic information representation in different layers of these powerful networks.
arXiv Detail & Related papers (2024-06-14T20:07:21Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Data-driven Detection and Analysis of the Patterns of Creaky Voice [13.829936505895692]
Creaky voice is a quality frequently used as a phrase-boundary marker.
The automatic detection and modelling of creaky voice may have implications for speech technology applications.
arXiv Detail & Related papers (2020-05-31T13:34:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.