The Phonexia VoxCeleb Speaker Recognition Challenge 2021 System
Description
- URL: http://arxiv.org/abs/2109.02052v3
- Date: Wed, 8 Sep 2021 08:37:51 GMT
- Title: The Phonexia VoxCeleb Speaker Recognition Challenge 2021 System
Description
- Authors: Josef Slav\'i\v{c}ek and Albert Swart and Michal Kl\v{c}o and Niko
Br\"ummer
- Abstract summary: We describe the Phonexia submission for the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21) in the unsupervised speaker verification track.
An embedding extractor was bootstrapped using momentum contrastive learning, with input augmentations as the only source of supervision.
A score fusion was done, by averaging the zt-normalized cosine scores of five different embedding extractors.
- Score: 1.3687617973585977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe the Phonexia submission for the VoxCeleb Speaker Recognition
Challenge 2021 (VoxSRC-21) in the unsupervised speaker verification track. Our
solution was very similar to IDLab's winning submission for VoxSRC-20. An
embedding extractor was bootstrapped using momentum contrastive learning, with
input augmentations as the only source of supervision. This was followed by
several iterations of clustering to assign pseudo-speaker labels that were then
used for supervised embedding extractor training. Finally, a score fusion was
done, by averaging the zt-normalized cosine scores of five different embedding
extractors. We briefly also describe unsuccessful solutions involving i-vectors
instead of DNN embeddings and PLDA instead of cosine scoring.
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - The Newsbridge -Telecom SudParis VoxCeleb Speaker Recognition Challenge
2022 System Description [0.0]
We describe the system used by our team for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC 2022) in the speaker diarization track.
Our solution was designed around a new combination of voice activity detection algorithms that uses the strengths of several systems.
arXiv Detail & Related papers (2023-01-17T15:52:39Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - The JHU submission to VoxSRC-21: Track 3 [31.804401484416452]
This report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed)
Our overall training process is similar to the proposed one from the last year's VoxSRC 2020 challenge.
This is our best submitted model to the challenge, showing 1.89, 6.50, and 6.89 in EER(%) in voxceleb1 test o, VoxSRC-21 validation, and test trials, respectively.
arXiv Detail & Related papers (2021-09-28T01:30:10Z) - Query Expansion System for the VoxCeleb Speaker Recognition Challenge
2020 [9.908371711364717]
We describe our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020.
One is to apply query expansion on speaker verification, which shows significant progress compared to baseline in the study.
Another is to combine its Probabilistic Linear Discriminant Analysis (PLDA) score with ResNet score.
arXiv Detail & Related papers (2020-11-04T05:24:18Z) - The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker
Diarisation Challenge [6.6238321827660345]
This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020.
Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals.
arXiv Detail & Related papers (2020-10-22T12:42:07Z) - Exploring the Use of an Unsupervised Autoregressive Model as a Shared
Encoder for Text-Dependent Speaker Verification [22.894402178709136]
We propose a novel way of addressing text-dependent automatic speaker verification (TD-ASV) by using a shared-encoder with task-specific decoders.
We show that the proposed approach can leverage from large, unlabeled, data-rich domains, and learn speech patterns independent of downstream tasks.
arXiv Detail & Related papers (2020-08-08T22:47:10Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.