Related papers: Robust Speaker Recognition Using Speech Enhancement And Attention Model

Robust Speaker Recognition Using Speech Enhancement And Attention Model

URL: http://arxiv.org/abs/2001.05031v2
Date: Fri, 22 May 2020 09:16:56 GMT
Title: Robust Speaker Recognition Using Speech Enhancement And Attention Model
Authors: Yanpei Shi, Qiang Huang, Thomas Hain
Abstract summary: Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain. The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
Score: 37.33388614967888
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. Its aim is to improve speaker recognition performance when speech signals are corrupted by noise. Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. Furthermore, to increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain. To evaluate speaker identification and verification performance of the proposed approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark datasets. Moreover, the robustness of our proposed approach is also tested on VoxCeleb1 data when being corrupted by three types of interferences, general noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.

Related papers

Visual-Informed Speech Enhancement Using Attention-Based Beamforming [13.084978776817222]
We propose a novel Visual-Informed Neural Beamforming Network (VI-NBFNet)<n>The proposed network integrates microphone array signal processing and deep neural networks (DNNs) using multimodal input features.<n>It is intended to handle both static and moving speakers by introducing a supervised end-to-end beamforming framework equipped with an attention mechanism.
arXiv Detail & Related papers (2026-03-05T15:19:41Z)
Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture [0.0]
The proposed method incorporates a dedicated noise identification module that operates concurrently with speech transcription.<n> Experimental validation using publicly available speech and environmental audio datasets demonstrates substantial improvements in transcription quality and noise discrimination.
arXiv Detail & Related papers (2025-12-02T18:54:45Z)
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z)
Multi-Stage Speaker Diarization for Noisy Classrooms [1.4549461207028445]
This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline.<n>We assess the impact of denoising on diarization accuracy and compare various voice activity detection models.<n>We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions.
arXiv Detail & Related papers (2025-05-16T05:35:06Z)
End-to-end multi-channel speaker extraction and binaural speech synthesis [26.373624846079686]
Speech clarity and spatial audio immersion are two most critical factors in enhancing remote conferencing experiences.<n>We introduce an end-to-end deep learning framework that has the capacity of mapping multi-channel noisy and reverberant signals to clean and spatialized speech directly.<n>In this framework, a novel magnitude-weighted interaural level difference loss function is proposed that aims to improve the accuracy of spatial rendering.
arXiv Detail & Related papers (2024-10-08T06:55:35Z)
Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments [0.2916558661202724]
We develop a transformer-based model that jointly performs speech recognition and speaker identification. We show that the joint model performs comparably to Whisper under clean conditions. Our results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.
arXiv Detail & Related papers (2024-10-07T18:39:59Z)
A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model [14.795953417531907]
We propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system. The proposed method achieve 19.26% improvement when compared with a strong baseline.
arXiv Detail & Related papers (2024-01-05T07:11:13Z)
In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Audio-visual Speech Separation with Adversarially Disentangled Visual Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem. Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z)
Speaker Re-identification with Speaker Dependent Speech Enhancement [37.33388614967888]
This paper introduces a novel approach that cascades speech enhancement and speaker recognition. The proposed approach is evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition in real world situations.
arXiv Detail & Related papers (2020-05-15T23:02:10Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.