Related papers: Interpolating Speaker Identities in Embedding Space for Data Expansion

Interpolating Speaker Identities in Embedding Space for Data Expansion

URL: http://arxiv.org/abs/2508.19210v1
Date: Tue, 26 Aug 2025 17:15:42 GMT
Title: Interpolating Speaker Identities in Embedding Space for Data Expansion
Authors: Tianchi Liu, Ruijie Tao, Qiongqiong Wang, Yidi Jiang, Hardik B. Sailor, Ke Zhang, Jingru Lin, Haizhou Li,
Abstract summary: INSIDE (Interpolating Speaker Identities in Embedding Space) is a novel data expansion method that synthesizes new speaker identities by interpolating between existing speaker embeddings.<n>Models trained with INSIDE-expanded data outperform those trained only on real data, achieving 3.06% to 5.24% relative improvements.
Score: 38.856864258602165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The success of deep learning-based speaker verification systems is largely attributed to access to large-scale and diverse speaker identity data. However, collecting data from more identities is expensive, challenging, and often limited by privacy concerns. To address this limitation, we propose INSIDE (Interpolating Speaker Identities in Embedding Space), a novel data expansion method that synthesizes new speaker identities by interpolating between existing speaker embeddings. Specifically, we select pairs of nearby speaker embeddings from a pretrained speaker embedding space and compute intermediate embeddings using spherical linear interpolation. These interpolated embeddings are then fed to a text-to-speech system to generate corresponding speech waveforms. The resulting data is combined with the original dataset to train downstream models. Experiments show that models trained with INSIDE-expanded data outperform those trained only on real data, achieving 3.06\% to 5.24\% relative improvements. While INSIDE is primarily designed for speaker verification, we also validate its effectiveness on gender classification, where it yields a 13.44\% relative improvement. Moreover, INSIDE is compatible with other augmentation techniques and can serve as a flexible, scalable addition to existing training pipelines.

Related papers

Speaker Embeddings to Improve Tracking of Intermittent and Moving Speakers [53.12031345322412]
We propose to perform identity reassignment post-tracking, using speaker embeddings.<n>Beamforming is used to enhance the signal towards the speakers' positions in order to compute speaker embeddings.<n>We evaluate the performance of the proposed speaker embedding-based identity reassignment method on a dataset where speakers change position during inactivity periods.
arXiv Detail & Related papers (2025-06-23T13:02:20Z)
Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives. We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources. We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z)
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification. DASA generates diversified training samples in speaker embedding space with negligible extra computing cost. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning [0.0]
We explore self-supervised learning for speaker verification by learning representations directly from raw audio. Our approach is based on recent information learning frameworks and an intensive data pre-processing step.
arXiv Detail & Related papers (2022-07-12T13:01:55Z)
Training speaker recognition systems with limited data [2.3148470932285665]
This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited.
arXiv Detail & Related papers (2022-03-28T12:41:41Z)
End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z)
Speaker diarization with session-level speaker embedding refinement using graph neural networks [26.688724154619504]
We present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally. The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated. We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data.
arXiv Detail & Related papers (2020-05-22T19:52:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.