Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances
- URL: http://arxiv.org/abs/2002.06033v1
- Date: Fri, 14 Feb 2020 13:34:33 GMT
- Title: Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances
- Authors: Aleksei Gusev, Vladimir Volokhov, Tseren Andzhukaev, Sergey Novoselov,
Galina Lavrentyeva, Marina Volkova, Alice Gazizullina, Andrey Shulipa, Artem
Gorlanov, Anastasia Avdeeva, Artem Ivanov, Alexander Kozlov, Timur Pekhovsky,
Yuri Matveev
- Abstract summary: Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
- Score: 53.063441357826484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker recognition systems based on deep speaker embeddings have achieved
significant performance in controlled conditions according to the results
obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the
practical point of view, taking into account the increased interest in virtual
assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker
verification on short utterances in uncontrolled noisy environment conditions
is one of the most challenging and highly demanded tasks. This paper presents
approaches aimed to achieve two goals: a) improve the quality of far-field
speaker verification systems in the presence of environmental noise,
reverberation and b) reduce the system qualitydegradation for short utterances.
For these purposes, we considered deep neural network architectures based on
TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We
experimented with state-of-the-art embedding extractors and their training
procedures. Obtained results confirm that ResNet architectures outperform the
standard x-vector approach in terms of speaker verification quality for both
long-duration and short-duration utterances. We also investigate the impact of
speech activity detector, different scoring models, adaptation and score
normalization techniques. The experimental results are presented for publicly
available data and verification protocols for the VoxCeleb1, VoxCeleb2, and
VOiCES datasets.
Related papers
- Probing the Information Encoded in Neural-based Acoustic Models of
Automatic Speech Recognition Systems [7.207019635697126]
This article aims to determine which and where information is located in an automatic speech recognition acoustic model (AM)
Experiments are performed on speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification.
Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition.
arXiv Detail & Related papers (2024-02-29T18:43:53Z) - Deep Neural Networks for Automatic Speaker Recognition Do Not Learn
Supra-Segmental Temporal Features [2.724035499453558]
We present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST.
We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced.
arXiv Detail & Related papers (2023-11-01T12:45:31Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection [0.4511923587827302]
SpecRNet is a neural network architecture characterized by a quick inference time and low computational requirements.
Our benchmark shows that SpecRNet, requiring up to about 40% less time to process an audio sample, provides performance comparable to LCNN architecture.
arXiv Detail & Related papers (2022-10-12T11:36:14Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.