PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction
- URL: http://arxiv.org/abs/2110.00940v1
- Date: Sun, 3 Oct 2021 07:05:29 GMT
- Title: PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction
- Authors: Yi Ma and Kong Aik Lee and Ville Hautamaki and Haizhou Li
- Abstract summary: Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
- Score: 90.55375210094995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech enhancement aims to improve the perceptual quality of the speech
signal by suppression of the background noise. However, excessive suppression
may lead to speech distortion and speaker information loss, which degrades the
performance of speaker embedding extraction. To alleviate this problem, we
propose an end-to-end deep learning framework, dubbed PL-EESR, for robust
speaker representation extraction. This framework is optimized based on the
feedback of the speaker identification task and the high-level perceptual
deviation between the raw speech signal and its noisy version. We conducted
speaker verification tasks in both noisy and clean environment respectively to
evaluate our system. Compared to the baseline, our method shows better
performance in both clean and noisy environments, which means our method can
not only enhance the speaker relative information but also avoid adding
distortions.
Related papers
- Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z) - Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios [0.9094127664014627]
End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap.
This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
arXiv Detail & Related papers (2024-07-01T14:26:28Z) - TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Disentangled dimensionality reduction for noise-robust speaker
diarisation [30.383712356205084]
Speaker embeddings play a crucial role in the performance of diarisation systems.
Speaker embeddings often capture spurious information such as noise and reverberation, adversely affecting performance.
We propose a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings.
We also propose the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise.
arXiv Detail & Related papers (2021-10-07T12:19:09Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.