Related papers: Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

URL: http://arxiv.org/abs/2509.14632v1
Date: Thu, 18 Sep 2025 05:21:20 GMT
Title: Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation
Authors: Miseul Kim, Soo Jin Park, Kyungguen Byun, Hyeon-Kyeong Shin, Sunkuk Moon, Shuhua Zhang, Erik Visser,
Abstract summary: We propose a style-controllable speech generation model to augment speech across diverse styles.<n>The proposed system starts with diarized segments from a conventional diarizer.<n>Speaker embeddings from both the original and generated audio are blended to enhance the system's robustness.
Score: 6.289152035711056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability. We validate our approach on a simulated emotional speech dataset and the truncated AMI dataset, demonstrating significant improvements, with error rate reductions of 49% and 35% on each dataset, respectively.

Related papers

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis [20.80178325643714]
In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings.<n>We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm.<n>To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns.
arXiv Detail & Related papers (2025-07-02T22:16:42Z)
Improving speaker verification robustness with synthetic emotional utterances [14.63248006004598]
A speaker verification (SV) system offers an authentication service designed to confirm whether a given speech sample originates from a specific speaker.<n>Previous models exhibit high error rates when dealing with emotional utterances compared to neutral ones.<n>This issue primarily stems from the limited availability of labeled emotional speech data.<n>We propose a novel approach employing the CycleGAN framework to serve as a data augmentation method.
arXiv Detail & Related papers (2024-11-30T02:18:26Z)
We Need Variations in Speech Generation: Sub-center Modelling for Speaker Embeddings [47.2515056854372]
We propose a novel speaker embedding network that employs multiple sub-centers per speaker class during training.<n>This sub-center modeling allows the embedding to capture a broader range of speaker-specific variations while maintaining speaker classification performance.
arXiv Detail & Related papers (2024-07-05T06:54:24Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems. We introduce spoken language understanding modules to extract speaker-related semantic information. We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z)
Cross-speaker style transfer for text-to-speech using data augmentation [11.686745250628247]
We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages.
arXiv Detail & Related papers (2022-02-10T15:10:56Z)
High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)
Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS) Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner. We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z)
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning. A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics. SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures. We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.