STC speaker recognition systems for the NIST SRE 2021
- URL: http://arxiv.org/abs/2111.02298v1
- Date: Wed, 3 Nov 2021 15:31:01 GMT
- Title: STC speaker recognition systems for the NIST SRE 2021
- Authors: Anastasia Avdeeva, Aleksei Gusev, Igor Korsunov, Alexander Kozlov,
Galina Lavrentyeva, Sergey Novoselov, Timur Pekhovsky, Andrey Shulipa, Alisa
Vinogradova, Vladimir Volokhov, Evgeny Smirnov, Vasily Galyuk
- Abstract summary: This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
- Score: 56.05258832139496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a description of STC Ltd. systems submitted to the NIST
2021 Speaker Recognition Evaluation for both fixed and open training
conditions. These systems consists of a number of diverse subsystems based on
using deep neural networks as feature extractors. During the NIST 2021 SRE
challenge we focused on the training of the state-of-the-art deep speaker
embeddings extractors like ResNets and ECAPA networks by using additive angular
margin based loss functions. Additionally, inspired by the recent success of
the wav2vec 2.0 features in automatic speech recognition we explored the
effectiveness of this approach for the speaker verification filed. According to
our observation the fine-tuning of the pretrained large wav2vec 2.0 model
provides our best performing systems for open track condition. Our experiments
with wav2vec 2.0 based extractors for the fixed condition showed that
unsupervised autoregressive pretraining with Contrastive Predictive Coding loss
opens the door to training powerful transformer-based extractors from raw
speech signals. For video modality we developed our best solution with
RetinaFace face detector and deep ResNet face embeddings extractor trained on
large face image datasets. The final results for primary systems were obtained
by different configurations of subsystems fusion on the score level followed by
score calibration.
Related papers
- Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings [7.42741711946564]
We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks.
In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
arXiv Detail & Related papers (2023-06-01T14:00:47Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - The THUEE System Description for the IARPA OpenASR21 Challenge [12.458730613670316]
This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21)
We achieve outstanding results under both the Constrained and Constrained-plus training conditions.
We find that the feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture.
arXiv Detail & Related papers (2022-06-29T14:03:05Z) - Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection [7.42741711946564]
This work introduces the application of speech embeddings extracted with pre-trained deep models trained on massive audio datasets for different tasks.
In comparison to the standard stuttering detection system trained only on the limited SEP-28k dataset, we obtain a relative improvement of 16.74% in terms of overall accuracy over baseline.
arXiv Detail & Related papers (2022-04-04T15:12:25Z) - Robust Speaker Recognition with Transformers Using wav2vec 2.0 [7.419725234099729]
This paper presents an investigation of using wav2vec 2.0 deep speech representations for the speaker recognition task.
It is concluded that Contrastive Predictive Coding pretraining scheme efficiently utilizes the power of unlabeled data.
arXiv Detail & Related papers (2022-03-28T20:59:58Z) - SVSNet: An End-to-end Speaker Voice Similarity Assessment Model [61.3813595968834]
We propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between natural speech and synthesized speech.
The experimental results on the Voice Conversion Challenge 2018 and 2020 show that SVSNet notably outperforms well-known baseline systems.
arXiv Detail & Related papers (2021-07-20T10:19:46Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z) - A Two-Stage Approach to Device-Robust Acoustic Scene Classification [63.98724740606457]
Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness.
Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set.
Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
arXiv Detail & Related papers (2020-11-03T03:27:18Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.