Robust Speaker Recognition with Transformers Using wav2vec 2.0
- URL: http://arxiv.org/abs/2203.15095v1
- Date: Mon, 28 Mar 2022 20:59:58 GMT
- Title: Robust Speaker Recognition with Transformers Using wav2vec 2.0
- Authors: Sergey Novoselov, Galina Lavrentyeva, Anastasia Avdeeva, Vladimir
Volokhov, Aleksei Gusev
- Abstract summary: This paper presents an investigation of using wav2vec 2.0 deep speech representations for the speaker recognition task.
It is concluded that Contrastive Predictive Coding pretraining scheme efficiently utilizes the power of unlabeled data.
- Score: 7.419725234099729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in unsupervised speech representation learning discover new
approaches and provide new state-of-the-art for diverse types of speech
processing tasks. This paper presents an investigation of using wav2vec 2.0
deep speech representations for the speaker recognition task. The proposed
fine-tuning procedure of wav2vec 2.0 with simple TDNN and statistic pooling
back-end using additive angular margin loss allows to obtain deep speaker
embedding extractor that is well-generalized across different domains. It is
concluded that Contrastive Predictive Coding pretraining scheme efficiently
utilizes the power of unlabeled data, and thus opens the door to powerful
transformer-based speaker recognition systems. The experimental results
obtained in this study demonstrate that fine-tuning can be done on relatively
small sets and a clean version of data. Using data augmentation during
fine-tuning provides additional performance gains in speaker verification. In
this study speaker recognition systems were analyzed on a wide range of
well-known verification protocols: VoxCeleb1 cleaned test set, NIST SRE 18
development set, NIST SRE 2016 and NIST SRE 2019 evaluation set, VOiCES
evaluation set, NIST 2021 SRE, and CTS challenges sets.
Related papers
- Exploring Self-supervised Pre-trained ASR Models For Dysarthric and
Elderly Speech Recognition [57.31233839489528]
This paper explores approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition.
arXiv Detail & Related papers (2023-02-28T13:39:17Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - Fine-tuning wav2vec2 for speaker recognition [3.69563307866315]
We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding.
To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss.
arXiv Detail & Related papers (2021-09-30T12:16:47Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.