Related papers: DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

URL: http://arxiv.org/abs/2510.17662v1
Date: Mon, 20 Oct 2025 15:35:55 GMT
Title: DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model
Authors: Massa Baali, Rita Singh, Bhiksha Raj,
Abstract summary: DELULU is a speaker-aware self-supervised foundational model for verification, diarization, and profiling applications.<n>It is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization.<n>Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
Score: 65.93900011975238
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.

Related papers

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling [4.875137823752148]
This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings.<n>Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features.<n>We present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames.
arXiv Detail & Related papers (2025-08-08T15:24:10Z)
TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models [43.761503775097104]
We introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB)<n>This benchmark includes four widely recognized target-speaker processing tasks.<n>Speaker embedding extracted from enrollment speech is used as a clue to condition downstream models.
arXiv Detail & Related papers (2025-05-10T14:23:37Z)
Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks? For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector. Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z)
Speaker Verification in Agent-Generated Conversations [47.6291644653831]
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. This study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker.
arXiv Detail & Related papers (2024-05-16T14:46:18Z)
Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective. Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z)
Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition? [86.53044183309824]
We study which factor leads to the success of self-supervised learning on speaker-related tasks. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.
arXiv Detail & Related papers (2022-04-27T08:35:57Z)
Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification [15.652180150706002]
We propose self-supervised speaker representation learning strategies. In the front-end, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the back-end, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker.
arXiv Detail & Related papers (2021-12-16T14:55:44Z)
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance. We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z)
Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning. We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.