Explainable Attribute-Based Speaker Verification
- URL: http://arxiv.org/abs/2405.19796v1
- Date: Thu, 30 May 2024 08:04:28 GMT
- Title: Explainable Attribute-Based Speaker Verification
- Authors: Xiaoliang Wu, Chau Luu, Peter Bell, Ajitha Rajan,
- Abstract summary: We propose an attribute-based explainable speaker verification (SV) system.
It identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings.
We believe this approach better aligns with human reasoning, making it more understandable than traditional methods.
- Score: 12.941187430993796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.
Related papers
- Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition [17.568724398229232]
Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction.
This study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance.
The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets.
arXiv Detail & Related papers (2024-06-01T00:39:55Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Adversarial Disentanglement of Speaker Representation for
Attribute-Driven Privacy Preservation [17.344080729609026]
We introduce the concept of attribute-driven privacy preservation in speaker voice representation.
It allows a person to hide one or more personal aspects to a potential malicious interceptor and to the application provider.
We propose an adversarial autoencoding method that disentangles in the voice representation a given speaker attribute thus allowing its concealment.
arXiv Detail & Related papers (2020-12-08T14:47:23Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Unsupervised Representation Disentanglement using Cross Domain Features
and Adversarial Learning in Variational Autoencoder based Voice Conversion [28.085498706505774]
An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal.
In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning.
arXiv Detail & Related papers (2020-01-22T02:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.