Related papers: Training speaker recognition systems with limited data

Training speaker recognition systems with limited data

URL: http://arxiv.org/abs/2203.14688v1
Date: Mon, 28 Mar 2022 12:41:41 GMT
Title: Training speaker recognition systems with limited data
Authors: Nik Vaessen and David A. van Leeuwen
Abstract summary: This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited.
Score: 2.3148470932285665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. These subsets are restricted to 50 k audio files (versus over 1 M files available), and vary on the axis of number of speakers and session variability. We train three speaker recognition systems on these subsets; the X-vector, ECAPA-TDNN, and wav2vec2 network architectures. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited. Code and data subsets are available at \url{https://github.com/nikvaessen/w2v2-speaker-few-samples}.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Interpolating Speaker Identities in Embedding Space for Data Expansion [38.856864258602165]
INSIDE (Interpolating Speaker Identities in Embedding Space) is a novel data expansion method that synthesizes new speaker identities by interpolating between existing speaker embeddings.<n>Models trained with INSIDE-expanded data outperform those trained only on real data, achieving 3.06% to 5.24% relative improvements.
arXiv Detail & Related papers (2025-08-26T17:15:42Z)
Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings [7.42741711946564]
We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks. In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
arXiv Detail & Related papers (2023-06-01T14:00:47Z)
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z)
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations. Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z)
Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
Jointly Learning Visual and Auditory Speech Representations from Raw Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our design is asymmetric w.r.t. driven by the inherent differences between video and audio. RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z)
Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection [7.42741711946564]
This work introduces the application of speech embeddings extracted with pre-trained deep models trained on massive audio datasets for different tasks. In comparison to the standard stuttering detection system trained only on the limited SEP-28k dataset, we obtain a relative improvement of 16.74% in terms of overall accuracy over baseline.
arXiv Detail & Related papers (2022-04-04T15:12:25Z)
Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T. Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z)
Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z)
Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes. Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.