PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction
- URL: http://arxiv.org/abs/2110.00940v1
- Date: Sun, 3 Oct 2021 07:05:29 GMT
- Title: PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction
- Authors: Yi Ma and Kong Aik Lee and Ville Hautamaki and Haizhou Li
- Abstract summary: Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
- Score: 90.55375210094995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech enhancement aims to improve the perceptual quality of the speech
signal by suppression of the background noise. However, excessive suppression
may lead to speech distortion and speaker information loss, which degrades the
performance of speaker embedding extraction. To alleviate this problem, we
propose an end-to-end deep learning framework, dubbed PL-EESR, for robust
speaker representation extraction. This framework is optimized based on the
feedback of the speaker identification task and the high-level perceptual
deviation between the raw speech signal and its noisy version. We conducted
speaker verification tasks in both noisy and clean environment respectively to
evaluate our system. Compared to the baseline, our method shows better
performance in both clean and noisy environments, which means our method can
not only enhance the speaker relative information but also avoid adding
distortions.
Related papers
- Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios [0.9094127664014627]
End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap.
This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
arXiv Detail & Related papers (2024-07-01T14:26:28Z) - TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
Speech Emotion Recognition (SER) is subject to ubiquitous environmental noise.
We introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge.
We show that TRNet substantially increases the system's robustness in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Disentangled dimensionality reduction for noise-robust speaker
diarisation [30.383712356205084]
Speaker embeddings play a crucial role in the performance of diarisation systems.
Speaker embeddings often capture spurious information such as noise and reverberation, adversely affecting performance.
We propose a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings.
We also propose the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise.
arXiv Detail & Related papers (2021-10-07T12:19:09Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.