Related papers: End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

URL: http://arxiv.org/abs/2405.14489v1
Date: Thu, 23 May 2024 12:24:01 GMT
Title: End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients
Authors: Kesavaraj V, Anuprabha M, Anil Kumar Vuppala,
Abstract summary: We propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability. The proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
Score: 6.626696929949397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.

Related papers

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection [5.4610555622532475]
We present STeInFormer, a spatial-temporal interaction Transformer architecture for multi-temporal feature extraction. We also propose a parameter-free multi-frequency token mixer to integrate frequency-domain features that provide spectral information for RSCD.
arXiv Detail & Related papers (2024-12-23T03:40:04Z)
Learning Multi-Target TDOA Features for Sound Event Localization and Detection [11.193111023459803]
We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.
arXiv Detail & Related papers (2024-08-30T10:09:12Z)
Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization [60.899082019130766]
We introduce a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization. FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions. PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN.
arXiv Detail & Related papers (2024-07-23T15:07:52Z)
Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum [13.81570624162769]
We propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration.
arXiv Detail & Related papers (2024-04-27T10:47:07Z)
Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features [11.212228410835435]
We study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for device-directed speech detection (DDSD) We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves performance by upto 8.5% in terms of false acceptance rate (FA) Our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time.
arXiv Detail & Related papers (2023-10-23T18:09:31Z)
Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data. This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features. Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z)
Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. We develop a two-branch deep network (KWS branch and SV branch) with the same network structure. A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary. We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z)
Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions. Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.