DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification
- URL: http://arxiv.org/abs/2601.13999v1
- Date: Tue, 20 Jan 2026 14:20:44 GMT
- Title: DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification
- Authors: Youngmoon Jung, Joon-Young Yang, Ju-ho Kim, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho,
- Abstract summary: Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments.<n>We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations.<n>DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning.
- Score: 24.474179536226362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.
Related papers
- TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding [15.908533215017059]
We present TagSpeech, a unified framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization.<n>The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that acts as a synchronization signal between semantic understanding and speaker tracking.
arXiv Detail & Related papers (2026-01-11T12:40:07Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model [65.93900011975238]
DELULU is a speaker-aware self-supervised foundational model for verification, diarization, and profiling applications.<n>It is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization.<n>Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
arXiv Detail & Related papers (2025-10-20T15:35:55Z) - Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings [52.985061676464554]
We propose a Knowledge Distillation based training approach for short context speaker embedding extraction.<n>We leverage the spatial information of the speaker of interest using beamforming to reduce overlap.<n>Results demonstrate that our models are effective at short-context embedding extraction and more robust to overlap.
arXiv Detail & Related papers (2025-08-18T11:32:13Z) - READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z) - TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs [56.92627816895305]
Video large language models have achieved remarkable performance in tasks such as video question answering.<n>Our dataset focuses on enhancing temporal comprehension across five key dimensions.<n>We introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets.
arXiv Detail & Related papers (2025-03-13T03:05:11Z) - Universal speaker recognition encoders for different speech segments
duration [7.104489204959814]
A system trained simultaneously on pooled short and long speech segments does not give optimal verification results.
We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture.
arXiv Detail & Related papers (2022-10-28T16:06:00Z) - Segment Aggregation for short utterances speaker verification using raw
waveforms [47.41124427552161]
We propose a method that compensates for the performance degradation of speaker verification for short utterances.
The proposed method adopts an ensemble-based design to improve the stability and accuracy of speaker verification systems.
arXiv Detail & Related papers (2020-05-07T08:57:22Z) - Meta-Learning for Short Utterance Speaker Recognition with Imbalance
Length Pairs [65.28795726837386]
We introduce a meta-learning framework for imbalance length pairs.
We train it with a support set of long utterances and a query set of short utterances of varying lengths.
By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models.
arXiv Detail & Related papers (2020-04-06T17:53:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.