A Comparative Re-Assessment of Feature Extractors for Deep Speaker
Embeddings
- URL: http://arxiv.org/abs/2007.15283v1
- Date: Thu, 30 Jul 2020 07:55:58 GMT
- Title: A Comparative Re-Assessment of Feature Extractors for Deep Speaker
Embeddings
- Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen
- Abstract summary: We provide extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets.
Our findings reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction.
- Score: 18.684888457998284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern automatic speaker verification relies largely on deep neural networks
(DNNs) trained on mel-frequency cepstral coefficient (MFCC) features. While
there are alternative feature extraction methods based on phase, prosody and
long-term temporal operations, they have not been extensively studied with
DNN-based methods. We aim to fill this gap by providing extensive re-assessment
of 14 feature extractors on VoxCeleb and SITW datasets. Our findings reveal
that features equipped with techniques such as spectral centroids, group delay
function, and integrated noise suppression provide promising alternatives to
MFCCs for deep speaker embeddings extraction. Experimental results demonstrate
up to 16.3\% (VoxCeleb) and 25.1\% (SITW) relative decrease in equal error rate
(EER) to the baseline.
Related papers
- Frequency-Aware Deepfake Detection: Improving Generalizability through
Frequency Space Learning [81.98675881423131]
This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images.
Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries.
We introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors.
arXiv Detail & Related papers (2024-03-12T01:28:00Z) - Enhancing dysarthria speech feature representation with empirical mode
decomposition and Walsh-Hadamard transform [8.032273183441921]
We propose a feature enhancement for dysarthria speech called WHFEMD.
It combines empirical mode decomposition (EMD) and fast Walsh-Hadamard transform (FWHT) to enhance features.
arXiv Detail & Related papers (2023-12-30T13:25:26Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Multi-Frequency Information Enhanced Channel Attention Module for
Speaker Representation Learning [41.44950556040058]
We propose to utilize multi-frequency information and design two novel and effective attention modules.
The proposed attention modules can effectively capture more speaker information from multiple frequency components on the basis of DCT.
Experimental results demonstrate that our proposed SFSC and MFSC attention modules can efficiently generate more discriminative speaker representations.
arXiv Detail & Related papers (2022-07-10T21:19:36Z) - Optimizing Multi-Taper Features for Deep Speaker Verification [21.237143465298505]
We propose to optimize the multi-taper estimator jointly with a deep neural network trained for ASV tasks.
With a maximum improvement on the SITW corpus of 25.8% in terms of equal error rate over the static-taper, our method helps preserve a balanced level of leakage and variance.
arXiv Detail & Related papers (2021-10-21T08:56:11Z) - Optimized Power Normalized Cepstral Coefficients towards Robust Deep
Speaker Verification [21.237143465298505]
We revisit and optimize PNCCs by ablating its medium-time processor and by introducing channel energy normalization.
Experimental results with a DNN-based speaker verification system indicate substantial improvement over baseline PNCCs.
arXiv Detail & Related papers (2021-09-24T16:26:12Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - ADRN: Attention-based Deep Residual Network for Hyperspectral Image
Denoising [52.01041506447195]
We propose an attention-based deep residual network to learn a mapping from noisy HSI to the clean one.
Experimental results demonstrate that our proposed ADRN scheme outperforms the state-of-the-art methods both in quantitative and visual evaluations.
arXiv Detail & Related papers (2020-03-04T08:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.