Multi-task Learning with Cross Attention for Keyword Spotting
- URL: http://arxiv.org/abs/2107.07634v1
- Date: Thu, 15 Jul 2021 22:38:16 GMT
- Title: Multi-task Learning with Cross Attention for Keyword Spotting
- Authors: Takuya Higuchi, Anmol Gupta, Chandra Dhir
- Abstract summary: Keywords spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase.
There is a mismatch between the training criterion (phoneme recognition) and the target task (KWS)
Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data.
- Score: 8.103605110339519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyword spotting (KWS) is an important technique for speech applications,
which enables users to activate devices by speaking a keyword phrase. Although
a phoneme classifier can be used for KWS, exploiting a large amount of
transcribed data for automatic speech recognition (ASR), there is a mismatch
between the training criterion (phoneme recognition) and the target task (KWS).
Recently, multi-task learning has been applied to KWS to exploit both ASR and
KWS training data. In this approach, an output of an acoustic model is split
into two branches for the two tasks, one for phoneme transcription trained with
the ASR data and one for keyword classification trained with the KWS data. In
this paper, we introduce a cross attention decoder in the multi-task learning
framework. Unlike the conventional multi-task learning approach with the simple
split of the output layer, the cross attention decoder summarizes information
from a phonetic encoder by performing cross attention between the encoder
outputs and a trainable query sequence to predict a confidence score for the
KWS task. Experimental results on KWS tasks show that the proposed approach
outperformed the conventional multi-task learning with split branches and a
bi-directional long short-team memory decoder by 12% on average.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary.
We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z) - Multi-task Learning for Speaker Verification and Voice Trigger Detection [18.51531434428444]
We investigate training a single network to perform both tasks jointly.
We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data.
Results demonstrate that the network is able to encode both phonetic emphand speaker information in its learnt representations.
arXiv Detail & Related papers (2020-01-26T21:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.