Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention
- URL: http://arxiv.org/abs/2005.03867v4
- Date: Fri, 7 Aug 2020 07:23:57 GMT
- Title: Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention
- Authors: Myunghun Jung, Youngmoon Jung, Jahyun Goo, and Hoirin Kim
- Abstract summary: Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary.
We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
- Score: 13.883985850789443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyword spotting (KWS) and speaker verification (SV) have been studied
independently although it is known that acoustic and speaker domains are
complementary. In this paper, we propose a multi-task network that performs KWS
and SV simultaneously to fully utilize the interrelated domain information. The
multi-task network tightly combines sub-networks aiming at performance
improvement in challenging conditions such as noisy environments,
open-vocabulary KWS, and short-duration SV, by introducing novel techniques of
connectionist temporal classification (CTC)-based soft voice activity detection
(VAD) and global query attention. Frame-level acoustic and speaker information
is integrated with phonetically originated weights so that forms a word-level
global representation. Then it is used for the aggregation of feature vectors
to generate discriminative embeddings. Our proposed approach shows 4.06% and
26.71% relative improvements in equal error rate (EER) compared to the
baselines for both tasks. We also present a visualization example and results
of ablation experiments.
Related papers
- QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition [47.103732403296654]
Multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces.
We introduce a global-to-local quantization mechanism, which distills knowledge from stable global (clip-level) features into local (frame-level) ones.
Experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance.
arXiv Detail & Related papers (2023-09-29T20:48:44Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Cross-Modal Global Interaction and Local Alignment for Audio-Visual
Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR)
Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level.
Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Unified Deep Learning Framework for Short-Duration Speaker
Verification in Adverse Environments [16.91453126121351]
Speaker verification (SV) system should be robust to short speech segments, especially in noisy and reverberant environments.
To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD)
We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner.
arXiv Detail & Related papers (2020-10-06T04:51:45Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.