Interpretable Temporal Class Activation Representation for Audio Spoofing Detection
- URL: http://arxiv.org/abs/2406.08825v2
- Date: Sun, 16 Jun 2024 20:01:29 GMT
- Title: Interpretable Temporal Class Activation Representation for Audio Spoofing Detection
- Authors: Menglu Li, Xiao-Ping Zhang,
- Abstract summary: We utilize the wav2vec 2.0 model and attentive utterance-level features to integrate interpretability directly into the model's architecture.
Our model achieves state-of-the-art results, with an EER of 0.51% and a min t-DCF of 0.0165 on the ASVspoof 2019-LA set.
- Score: 7.476305130252989
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Explaining the decisions made by audio spoofing detection models is crucial for fostering trust in detection outcomes. However, current research on the interpretability of detection models is limited to applying XAI tools to post-trained models. In this paper, we utilize the wav2vec 2.0 model and attentive utterance-level features to integrate interpretability directly into the model's architecture, thereby enhancing transparency of the decision-making process. Specifically, we propose a class activation representation to localize the discriminative frames contributing to detection. Furthermore, we demonstrate that multi-label training based on spoofing types, rather than binary labels as bonafide and spoofed, enables the model to learn distinct characteristics of different attacks, significantly improving detection performance. Our model achieves state-of-the-art results, with an EER of 0.51% and a min t-DCF of 0.0165 on the ASVspoof2019-LA set.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - PASA: Attack Agnostic Unsupervised Adversarial Detection using Prediction & Attribution Sensitivity Analysis [2.5347892611213614]
Deep neural networks for classification are vulnerable to adversarial attacks, where small perturbations to input samples lead to incorrect predictions.
We develop a practical method for this characteristic of model prediction and feature attribution to detect adversarial samples.
Our approach demonstrates competitive performance even when an adversary is aware of the defense mechanism.
arXiv Detail & Related papers (2024-04-12T21:22:21Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection
Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications.
We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - Raw waveform speaker verification for supervised and self-supervised
learning [30.08242210230669]
This paper proposes a new raw waveform speaker verification model that incorporates techniques proven effective for speaker verification.
Under the best performing configuration, the model shows an equal error rate of 0.89%, competitive with state-of-the-art models.
We also explore the proposed model with a self-supervised learning framework and show the state-of-the-art performance in this line of research.
arXiv Detail & Related papers (2022-03-16T09:28:03Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - Novelty Detection Through Model-Based Characterization of Neural
Networks [19.191613437266184]
We propose a model-based characterization of neural networks to detect novel input types and conditions.
We validate our approach using four image recognition datasets including MNIST, Fashion-MNIST, CIFAR-10, and CURE-TSR.
arXiv Detail & Related papers (2020-08-13T20:03:25Z) - Self-Supervised Contrastive Learning for Unsupervised Phoneme
Segmentation [37.054709598792165]
The model is a convolutional neural network that operates directly on the raw waveform.
It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle.
At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries.
arXiv Detail & Related papers (2020-07-27T12:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.