Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
- URL: http://arxiv.org/abs/2206.15400v2
- Date: Fri, 1 Jul 2022 06:42:55 GMT
- Title: Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
- Authors: Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung and Hong-Goo
Kang
- Abstract summary: We propose a novel end-to-end user-defined keyword spotting method.
Our method compares input queries with an enrolled text keyword sequence.
We introduce the LibriPhrase dataset for efficiently training keyword spotting models.
- Score: 23.627625026135505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel end-to-end user-defined keyword spotting
method that utilizes linguistically corresponding patterns between speech and
text sequences. Unlike previous approaches requiring speech keyword enrollment,
our method compares input queries with an enrolled text keyword sequence. To
place the audio and text representations within a common latent space, we adopt
an attention-based cross-modal matching approach that is trained in an
end-to-end manner with monotonic matching loss and keyword classification loss.
We also utilize a de-noising loss for the acoustic embedding network to improve
robustness in noisy environments. Additionally, we introduce the LibriPhrase
dataset, a new short-phrase dataset based on LibriSpeech for efficiently
training keyword spotting models. Our proposed method achieves competitive
results on various evaluation sets compared to other single-modal and
cross-modal baselines.
Related papers
- CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting [6.856101216726412]
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment.
For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC)
We then aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text.
arXiv Detail & Related papers (2024-06-12T06:44:40Z) - Relational Proxy Loss for Audio-Text based Keyword Spotting [8.932603220365793]
This study aims to improve existing methods by leveraging the structural acoustic embeddings and within text embeddings.
By incorporating RPL, we demonstrated improved performance on the Wall Street Journal (WSJ) corpus.
arXiv Detail & Related papers (2024-06-08T01:21:17Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Open-vocabulary Keyword-spotting with Adaptive Instance Normalization [18.250276540068047]
We propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters.
We show significant improvements over recent keyword spotting and ASR baselines.
arXiv Detail & Related papers (2023-09-13T13:49:42Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Neural approaches to spoken content embedding [1.3706331473063877]
We contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs)
We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition.
arXiv Detail & Related papers (2023-08-28T21:16:08Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Acoustic Word Embedding System for Code-Switching Query-by-example
Spoken Term Detection [17.54377669932433]
We propose a deep convolutional neural network-based acoustic word embedding system on code-switching query by example spoken term detection.
We combine audio data in two languages for training instead of only using one single language.
arXiv Detail & Related papers (2020-05-24T15:27:56Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.