Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding
- URL: http://arxiv.org/abs/2308.06472v1
- Date: Sat, 12 Aug 2023 05:41:15 GMT
- Title: Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding
- Authors: Kumari Nishu, Minsik Cho, Paul Dixon, Devang Naik
- Abstract summary: We propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder.
Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors.
Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset.
- Score: 5.697227044927832
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Spotting user-defined/flexible keywords represented in text frequently uses
an expensive text encoder for joint analysis with an audio encoder in an
embedding space, which can suffer from heterogeneous modality representation
(i.e., large mismatch) and increased complexity. In this work, we propose a
novel architecture to efficiently detect arbitrary keywords based on an
audio-compliant text encoder which inherently has homogeneous representation
with audio embedding, and it is also much smaller than a compatible text
encoder. Our text encoder converts the text to phonemes using a
grapheme-to-phoneme (G2P) model, and then to an embedding using representative
phoneme vectors, extracted from the paired audio encoder on rich speech
datasets. We further augment our method with confusable keyword generation to
develop an audio-text embedding verifier with strong discriminative power.
Experimental results show that our scheme outperforms the state-of-the-art
results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC)
metric from 84.21% to 92.7% and reducing Equal-Error-Rate (EER) metric from
23.36% to 14.4%.
Related papers
- VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Bridging Language Gaps in Audio-Text Retrieval [28.829775980536574]
We propose a language enhancement (LE) using a multilingual text encoder (SONAR) to encode the text data with language-specific information.
We optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval.
Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho.
arXiv Detail & Related papers (2024-06-11T07:12:12Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Matching Latent Encoding for Audio-Text based Keyword Spotting [9.599402723927733]
We propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS)
Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence.
Experimental results show that our DSP is more effective than other partitioning schemes.
arXiv Detail & Related papers (2023-06-08T14:44:23Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Acoustic Neighbor Embeddings [2.842794675894731]
This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings.
The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences.
The recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
arXiv Detail & Related papers (2020-07-20T05:33:07Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.