Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding
- URL: http://arxiv.org/abs/2308.06472v1
- Date: Sat, 12 Aug 2023 05:41:15 GMT
- Title: Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding
- Authors: Kumari Nishu, Minsik Cho, Paul Dixon, Devang Naik
- Abstract summary: We propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder.
Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors.
Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset.
- Score: 5.697227044927832
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Spotting user-defined/flexible keywords represented in text frequently uses
an expensive text encoder for joint analysis with an audio encoder in an
embedding space, which can suffer from heterogeneous modality representation
(i.e., large mismatch) and increased complexity. In this work, we propose a
novel architecture to efficiently detect arbitrary keywords based on an
audio-compliant text encoder which inherently has homogeneous representation
with audio embedding, and it is also much smaller than a compatible text
encoder. Our text encoder converts the text to phonemes using a
grapheme-to-phoneme (G2P) model, and then to an embedding using representative
phoneme vectors, extracted from the paired audio encoder on rich speech
datasets. We further augment our method with confusable keyword generation to
develop an audio-text embedding verifier with strong discriminative power.
Experimental results show that our scheme outperforms the state-of-the-art
results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC)
metric from 84.21% to 92.7% and reducing Equal-Error-Rate (EER) metric from
23.36% to 14.4%.
Related papers
- Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words [10.2138250640885]
We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords in text prompts.
We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder.
arXiv Detail & Related papers (2024-08-15T08:50:58Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Bridging Language Gaps in Audio-Text Retrieval [28.829775980536574]
We propose a language enhancement (LE) using a multilingual text encoder (SONAR) to encode the text data with language-specific information.
We optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval.
Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho.
arXiv Detail & Related papers (2024-06-11T07:12:12Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Matching Latent Encoding for Audio-Text based Keyword Spotting [9.599402723927733]
We propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS)
Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence.
Experimental results show that our DSP is more effective than other partitioning schemes.
arXiv Detail & Related papers (2023-06-08T14:44:23Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.