Matching Latent Encoding for Audio-Text based Keyword Spotting
- URL: http://arxiv.org/abs/2306.05245v1
- Date: Thu, 8 Jun 2023 14:44:23 GMT
- Title: Matching Latent Encoding for Audio-Text based Keyword Spotting
- Authors: Kumari Nishu, Minsik Cho, Devang Naik
- Abstract summary: We propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS)
Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence.
Experimental results show that our DSP is more effective than other partitioning schemes.
- Score: 9.599402723927733
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown
high-quality results, but the key challenge of how to semantically align two
embeddings for multi-word keywords of different sequence lengths remains
largely unsolved. In this paper, we propose an audio-text-based end-to-end
model architecture for flexible keyword spotting (KWS), which builds upon
learned audio and text embeddings. Our architecture uses a novel dynamic
programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally
partition the audio sequence into the same length as the word-based text
sequence using the monotonic alignment of spoken content. Our proposed model
consists of an encoder block to get audio and text embeddings, a projector
block to project individual embeddings to a common latent space, and an
audio-text aligner containing a novel DSP algorithm, which aligns the audio and
text embeddings to determine if the spoken content is the same as the text.
Experimental results show that our DSP is more effective than other
partitioning schemes, and the proposed architecture outperformed the
state-of-the-art results on the public dataset in terms of Area Under the ROC
Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting [6.856101216726412]
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment.
For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC)
We then aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text.
arXiv Detail & Related papers (2024-06-12T06:44:40Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation [15.765495448426904]
We propose a novel approach to tackle the data imbalance problem in audio-language retrieval task.
A distance sampling-based paraphraser leveraging ChatGPT generates a controllable distribution of manipulated text data.
The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.
arXiv Detail & Related papers (2024-05-01T07:44:28Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding [5.697227044927832]
We propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder.
Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors.
Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset.
arXiv Detail & Related papers (2023-08-12T05:41:15Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Audio Captioning using Gated Recurrent Units [1.3960152426268766]
VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task.
The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage.
Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
arXiv Detail & Related papers (2020-06-05T12:03:12Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.