ShaneRun System Description to VoxCeleb Speaker Recognition Challenge
2020
- URL: http://arxiv.org/abs/2011.01518v1
- Date: Tue, 3 Nov 2020 07:26:21 GMT
- Title: ShaneRun System Description to VoxCeleb Speaker Recognition Challenge
2020
- Authors: Shen Chen
- Abstract summary: We describe the submission of ShaneRun's team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020.
We use ResNet-34 as encoder to extract the speaker embeddings, which is referenced from the open-source voxceleb-trainer.
The final submitted system got 0.3098 minDCF and 5.076 % ERR for Fixed data track, which outperformed the baseline by 1.3 % minDCF and 2.2 % ERR respectively.
- Score: 3.0712335337791288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we describe the submission of ShaneRun's team to the VoxCeleb
Speaker Recognition Challenge (VoxSRC) 2020. We use ResNet-34 as encoder to
extract the speaker embeddings, which is referenced from the open-source
voxceleb-trainer. We also provide a simple method to implement optimum fusion
using t-SNE normalized distance of testing utterance pairs instead of original
negative Euclidean distance from the encoder. The final submitted system got
0.3098 minDCF and 5.076 % ERR for Fixed data track, which outperformed the
baseline by 1.3 % minDCF and 2.2 % ERR respectively.
Related papers
- The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in
CNVSRC 2023 [67.11294606070278]
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data.
Various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation.
arXiv Detail & Related papers (2024-01-07T14:20:52Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - The GUA-Speech System Description for CNVSRC Challenge 2023 [8.5257557043542]
This study describes our system for Task 1 Single-speaker Visual Speech Recognition (VSR) fixed track in the Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
We use intermediate connectionist temporal classification (Inter CTC) residual modules to relax the conditional independence assumption of CTC in our model.
We also use a bi-transformer decoder to enable the model to capture both past and future contextual information.
arXiv Detail & Related papers (2023-12-12T13:35:33Z) - The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022 [4.022057598291766]
We describe the Royalflush submissions for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
For track 1, we develop a powerful U-Net-based speaker embedding extractor with a symmetric architecture.
For track 3, we employ the joint training of source domain supervision and target domain self-supervision to get a speaker embedding extractor.
arXiv Detail & Related papers (2022-09-19T13:35:36Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Query Expansion System for the VoxCeleb Speaker Recognition Challenge
2020 [9.908371711364717]
We describe our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020.
One is to apply query expansion on speaker verification, which shows significant progress compared to baseline in the study.
Another is to combine its Probabilistic Linear Discriminant Analysis (PLDA) score with ResNet score.
arXiv Detail & Related papers (2020-11-04T05:24:18Z) - The xx205 System for the VoxCeleb Speaker Recognition Challenge 2020 [2.7920304852537536]
This report describes the systems submitted to the first and second tracks of the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020.
The best submitted systems achieve an EER of $3.808%$ and a MinDCF of $0.1958$ in the close-condition track 1, and an EER of $3.798%$ and a MinDCF of $0.1942$ in the open-condition track 2, respectively.
arXiv Detail & Related papers (2020-10-31T06:36:26Z) - Word Error Rate Estimation Without ASR Output: e-WER2 [36.43741370454534]
We use a multistream end-to-end architecture to estimate the word error rate (WER) of speech recognition systems.
We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box) and for systems without access to the ASR system (no-box)
Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400 sentences.
arXiv Detail & Related papers (2020-08-08T00:19:09Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.