DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in
Non-ideal Audio Signals
- URL: http://arxiv.org/abs/2008.11668v2
- Date: Mon, 13 Jun 2022 03:39:05 GMT
- Title: DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in
Non-ideal Audio Signals
- Authors: Anurag Chowdhury, Arun Ross
- Abstract summary: We propose a deep learning-based technique to deduce the filterbank design from vast amounts of speech audio.
The purpose of such a filterbank is to extract features robust to non-ideal audio conditions, such as degraded, short duration, and multi-lingual speech.
- Score: 19.053492887246826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speaker recognition algorithms typically use pre-defined
filterbanks, such as Mel-Frequency and Gammatone filterbanks, for
characterizing speech audio. However, it has been observed that the features
extracted using these filterbanks are not resilient to diverse audio
degradations. In this work, we propose a deep learning-based technique to
deduce the filterbank design from vast amounts of speech audio. The purpose of
such a filterbank is to extract features robust to non-ideal audio conditions,
such as degraded, short duration, and multi-lingual speech. To this effect, a
1D convolutional neural network is designed to learn a time-domain filterbank
called DeepVOX directly from raw speech audio. Secondly, an adaptive triplet
mining technique is developed to efficiently mine the data samples best suited
to train the filterbank. Thirdly, a detailed ablation study of the DeepVOX
filterbanks reveals the presence of both vocal source and vocal tract
characteristics in the extracted features. Experimental results on VOXCeleb2,
NIST SRE 2008, 2010 and 2018, and Fisher speech datasets demonstrate the
efficacy of the DeepVOX features across a variety of degraded, short duration,
and multi-lingual speech. The DeepVOX features also shown to improve the
performance of existing speaker recognition algorithms, such as the
xVector-PLDA and the iVector-PLDA.
Related papers
- ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings [4.125756306660331]
Speaker Diarization (SD) aims at grouping speech segments that belong to the same speaker.
Beamforming, i.e., spatial filtering, is a common practice to process multi-microphone audio data.
This paper proposes a self-attention-based algorithm to select the output of a bank of fixed spatial filters.
arXiv Detail & Related papers (2024-06-05T13:28:28Z) - DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement [10.662665274373387]
We present a real-time speech enhancement demo using DeepFilterNet.
Our model is able to match state-of-the-art speech enhancement benchmarks while achieving a real-time-factor of 0.19 on a single threaded notebook CPU.
arXiv Detail & Related papers (2023-05-14T19:09:35Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - DeepFilterNet: A Low Complexity Speech Enhancement Framework for
Full-Band Audio based on Deep Filtering [9.200520879361916]
We propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering.
First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception.
The second stage employs deep filtering to enhance the periodic components of speech.
arXiv Detail & Related papers (2021-10-11T20:03:52Z) - Speakerfilter-Pro: an improved target speaker extractor combines the
time domain and frequency domain [28.830492233611196]
This paper introduces an improved target speaker extractor, referred to as Speakerfilter-Pro, based on our previous Speakerfilter model.
The Speakerfilter uses a bi-direction gated recurrent unit (BGRU) module to characterize the target speaker from anchor speech and use a convolutional recurrent network (CRN) module to separate the target speech from a noisy signal.
The WaveUNet has been proven to have a better ability to perform speech separation in the time domain.
arXiv Detail & Related papers (2020-10-25T07:30:30Z) - Optimization of data-driven filterbank for automatic speaker
verification [8.175789701289512]
We propose a new data-driven filter design method which optimize filter parameters from a given speech data.
The main advantage of the proposed method is that it requires very limited amount of unlabeled speech-data.
We show that the acoustic features created with proposed filterbank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based frequency cepstral coefficients (SFCCs) in most cases.
arXiv Detail & Related papers (2020-07-21T11:42:20Z) - SpEx: Multi-Scale Time Domain Speaker Extraction Network [89.00319878262005]
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment.
It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra.
We propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
arXiv Detail & Related papers (2020-04-17T16:13:06Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.