Disentangled dimensionality reduction for noise-robust speaker
diarisation
- URL: http://arxiv.org/abs/2110.03380v1
- Date: Thu, 7 Oct 2021 12:19:09 GMT
- Title: Disentangled dimensionality reduction for noise-robust speaker
diarisation
- Authors: You Jin Kim, Hee-Soo Heo, Jee-weon Jung, Youngki Kwon, Bong-Jin Lee,
Joon Son Chung
- Abstract summary: Speaker embeddings play a crucial role in the performance of diarisation systems.
Speaker embeddings often capture spurious information such as noise and reverberation, adversely affecting performance.
We propose a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings.
We also propose the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise.
- Score: 30.383712356205084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this work is to train noise-robust speaker embeddings for
speaker diarisation. Speaker embeddings play a crucial role in the performance
of diarisation systems, but they often capture spurious information such as
noise and reverberation, adversely affecting performance. Our previous work
have proposed an auto-encoder-based dimensionality reduction module to help
remove the spurious information. However, they do not explicitly separate such
information and have also been found to be sensitive to hyperparameter values.
To this end, we propose two contributions to overcome these issues: (i) a novel
dimensionality reduction framework that can disentangle spurious information
from the speaker embeddings; (ii) the use of a speech/non-speech indicator to
prevent the speaker code from learning from the background noise. Through a
range of experiments conducted on four different datasets, our approach
consistently demonstrates the state-of-the-art performance among models that do
not adopt ensembles.
Related papers
- Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios [0.9094127664014627]
End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap.
This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
arXiv Detail & Related papers (2024-07-01T14:26:28Z) - R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces [13.046304017209872]
This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations.
R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces.
arXiv Detail & Related papers (2023-11-15T17:07:44Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.