A Unified Deep Learning Framework for Short-Duration Speaker
Verification in Adverse Environments
- URL: http://arxiv.org/abs/2010.02477v1
- Date: Tue, 6 Oct 2020 04:51:45 GMT
- Title: A Unified Deep Learning Framework for Short-Duration Speaker
Verification in Adverse Environments
- Authors: Youngmoon Jung, Yeunju Choi, Hyungjun Lim, Hoirin Kim
- Abstract summary: Speaker verification (SV) system should be robust to short speech segments, especially in noisy and reverberant environments.
To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD)
We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner.
- Score: 16.91453126121351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker verification (SV) has recently attracted considerable research
interest due to the growing popularity of virtual assistants. At the same time,
there is an increasing requirement for an SV system: it should be robust to
short speech segments, especially in noisy and reverberant environments. In
this paper, we consider one more important requirement for practical
applications: the system should be robust to an audio stream containing long
non-speech segments, where a voice activity detection (VAD) is not applied. To
meet these two requirements, we introduce feature pyramid module (FPM)-based
multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present
the FPM-based MSA to deal with short speech segments in noisy and reverberant
environments. Also, we use the SAS-VAD to increase the robustness to long
non-speech segments. To further improve the robustness to acoustic distortions
(i.e., noise and reverberation), we apply a masking-based speech enhancement
(SE) method. We combine SV, VAD, and SE models in a unified deep learning
framework and jointly train the entire network in an end-to-end manner. To the
best of our knowledge, this is the first work combining these three models in a
deep learning framework. We conduct experiments on Korean indoor (KID) and
VoxCeleb datasets, which are corrupted by noise and reverberation. The results
show that the proposed method is effective for SV in the challenging conditions
and performs better than the baseline i-vector and deep speaker embedding
systems.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - CPM: Class-conditional Prompting Machine for Audio-visual Segmentation [17.477225065057993]
Class-conditional Prompting Machine (CPM) improves bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries.
We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2024-07-07T13:20:21Z) - Robust Active Speaker Detection in Noisy Environments [29.785749048315616]
We formulate a robust active speaker detection (rASD) problem in noisy environments.
Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance.
We propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features.
arXiv Detail & Related papers (2024-03-27T20:52:30Z) - Unsupervised Audio-Visual Segmentation with Modality Alignment [42.613786372067814]
Audio-Visual aims to identify, at the pixel level, the object in a visual scene that produces a given sound.
Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability.
We propose an unsupervised learning method, named Modality Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf foundation models.
arXiv Detail & Related papers (2024-03-21T07:56:09Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Joint speech and overlap detection: a benchmark over multiple audio
setup and speech domains [0.0]
VAD and OSD can be trained jointly using a multi-class classification model.
This paper proposes a complete and new benchmark of different VAD and OSD models.
Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results.
arXiv Detail & Related papers (2023-07-24T14:29:21Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary.
We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.