Related papers: Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings

Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings

URL: http://arxiv.org/abs/2508.14115v1
Date: Mon, 18 Aug 2025 11:32:13 GMT
Title: Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings
Authors: Taous Iatariene, Alexandre Guérin, Romain Serizel,
Abstract summary: We propose a Knowledge Distillation based training approach for short context speaker embedding extraction.<n>We leverage the spatial information of the speaker of interest using beamforming to reduce overlap.<n>Results demonstrate that our models are effective at short-context embedding extraction and more robust to overlap.
Score: 52.985061676464554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speaker embeddings are promising identity-related features that can enhance the identity assignment performance of a tracking system by leveraging its spatial predictions, i.e, by performing identity reassignment. Common speaker embedding extractors usually struggle with short temporal contexts and overlapping speech, which imposes long-term identity reassignment to exploit longer temporal contexts. However, this increases the probability of tracking system errors, which in turn impacts negatively on identity reassignment. To address this, we propose a Knowledge Distillation (KD) based training approach for short context speaker embedding extraction from two speaker mixtures. We leverage the spatial information of the speaker of interest using beamforming to reduce overlap. We study the feasibility of performing identity reassignment over blocks of fixed size, i.e., blockwise identity reassignment, to go towards a low-latency speaker embedding based tracking system. Results demonstrate that our distilled models are effective at short-context embedding extraction and more robust to overlap. Although, blockwise reassignment results indicate that further work is needed to handle simultaneous speech more effectively.

Related papers

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling [4.875137823752148]
This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings.<n>Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features.<n>We present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames.
arXiv Detail & Related papers (2025-08-08T15:24:10Z)
Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization [4.1088673993841685]
This paper investigates the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy.<n>We introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments.
arXiv Detail & Related papers (2025-07-25T15:05:20Z)
Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis [20.80178325643714]
In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings.<n>We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm.<n>To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns.
arXiv Detail & Related papers (2025-07-02T22:16:42Z)
Speaker Embeddings to Improve Tracking of Intermittent and Moving Speakers [53.12031345322412]
We propose to perform identity reassignment post-tracking, using speaker embeddings.<n>Beamforming is used to enhance the signal towards the speakers' positions in order to compute speaker embeddings.<n>We evaluate the performance of the proposed speaker embedding-based identity reassignment method on a dataset where speakers change position during inactivity periods.
arXiv Detail & Related papers (2025-06-23T13:02:20Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics. SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures. We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.