Relational Proxy Loss for Audio-Text based Keyword Spotting
- URL: http://arxiv.org/abs/2406.05314v1
- Date: Sat, 8 Jun 2024 01:21:17 GMT
- Title: Relational Proxy Loss for Audio-Text based Keyword Spotting
- Authors: Youngmoon Jung, Seungjin Lee, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho,
- Abstract summary: This study aims to improve existing methods by leveraging the structural acoustic embeddings and within text embeddings.
By incorporating RPL, we demonstrated improved performance on the Wall Street Journal (WSJ) corpus.
- Score: 8.932603220365793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, there has been an increasing focus on user convenience, leading to increased interest in text-based keyword enrollment systems for keyword spotting (KWS). Since the system utilizes text input during the enrollment phase and audio input during actual usage, we call this task audio-text based KWS. To enable this task, both acoustic and text encoders are typically trained using deep metric learning loss functions, such as triplet- and proxy-based losses. This study aims to improve existing methods by leveraging the structural relations within acoustic embeddings and within text embeddings. Unlike previous studies that only compare acoustic and text embeddings on a point-to-point basis, our approach focuses on the relational structures within the embedding space by introducing the concept of Relational Proxy Loss (RPL). By incorporating RPL, we demonstrated improved performance on the Wall Street Journal (WSJ) corpus.
Related papers
- Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.
We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting [6.856101216726412]
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment.
For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC)
We then aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text.
arXiv Detail & Related papers (2024-06-12T06:44:40Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [73.39366775301382]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - Open-vocabulary Keyword-spotting with Adaptive Instance Normalization [18.250276540068047]
We propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters.
We show significant improvements over recent keyword spotting and ASR baselines.
arXiv Detail & Related papers (2023-09-13T13:49:42Z) - Boosting Punctuation Restoration with Data Generation and Reinforcement
Learning [70.26450819702728]
Punctuation restoration is an important task in automatic speech recognition (ASR)
The discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts.
This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap.
arXiv Detail & Related papers (2023-07-24T17:22:04Z) - Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting [23.627625026135505]
We propose a novel end-to-end user-defined keyword spotting method.
Our method compares input queries with an enrolled text keyword sequence.
We introduce the LibriPhrase dataset for efficiently training keyword spotting models.
arXiv Detail & Related papers (2022-06-30T16:40:31Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Phoneme-aware and Channel-wise Attentive Learning for Text
DependentSpeaker Verification [21.826585075806573]
This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV)
The proposed system achieves outstanding results for textdependent SV.
arXiv Detail & Related papers (2021-06-25T09:11:18Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.