Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling
- URL: http://arxiv.org/abs/2001.09246v1
- Date: Sat, 25 Jan 2020 01:19:19 GMT
- Title: Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling
- Authors: Hyun-Jin Park, Patrick Violette, Niranjan Subrahmanya
- Abstract summary: We propose smoothed max pooling loss and its application to keyword spotting systems.
The proposed approach jointly trains an encoder (to detect keyword parts) and a decoder (to detect whole keyword) in a semi-supervised manner.
The proposed new loss function allows training a model to detect parts and whole of a keyword, without strictly depending on frame-level labeling from LVCSR.
- Score: 9.927306160740974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose smoothed max pooling loss and its application to keyword spotting
systems. The proposed approach jointly trains an encoder (to detect keyword
parts) and a decoder (to detect whole keyword) in a semi-supervised manner. The
proposed new loss function allows training a model to detect parts and whole of
a keyword, without strictly depending on frame-level labeling from LVCSR (Large
vocabulary continuous speech recognition), making further optimization
possible. The proposed system outperforms the baseline keyword spotting model
in [1] due to increased optimizability. Further, it can be more easily adapted
for on-device learning applications due to reduced dependency on LVCSR.
Related papers
- Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z) - Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs [6.790758328248708]
FiNDR (Fine-grained Name Discovery via Reasoning) is a framework for vocabulary-free fine-grained recognition.<n>It operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time.<n>Experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8%
arXiv Detail & Related papers (2025-12-21T22:01:29Z) - Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation [90.5844979560448]
VocAlign is a source-free domain adaptation framework specifically designed for VLMs in semantic segmentation.<n>Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks.
arXiv Detail & Related papers (2025-09-18T17:59:58Z) - TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree [17.16475665648591]
This paper proposes a universal context-biasing framework for Automatic Speech Recognition (ASR)<n>The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding.<n>The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches.
arXiv Detail & Related papers (2025-08-09T15:27:07Z) - Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models [3.25590215530292]
Keywords Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices.<n>We propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation.<n>We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets.
arXiv Detail & Related papers (2025-06-21T11:39:11Z) - Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection [0.0]
We propose a novel multitask learning approach that integrates a noise classification head into the ASR encoder.
Our method enhances the model's robustness to noisy environments, leading to a significant reduction in false alarms and improved overall call-for-help performance.
arXiv Detail & Related papers (2025-01-20T18:01:42Z) - MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone.
MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - Open-vocabulary Keyword-spotting with Adaptive Instance Normalization [18.250276540068047]
We propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters.
We show significant improvements over recent keyword spotting and ASR baselines.
arXiv Detail & Related papers (2023-09-13T13:49:42Z) - Few-Shot Open-Set Learning for On-Device Customization of KeyWord
Spotting Systems [41.24728444810133]
This paper investigates few-shot learning methods for open-set KWS classification by combining a deep feature encoder with a prototype-based classifier.
With user-defined keywords from 10 classes of the Google Speech Command dataset, our study reports an accuracy of up to 76% in a 10-shot scenario.
arXiv Detail & Related papers (2023-06-03T17:10:33Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting [6.4423565043274795]
We tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC.
We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets)
We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
arXiv Detail & Related papers (2022-06-28T01:56:24Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Discriminative Nearest Neighbor Few-Shot Intent Detection by
Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity.
We present a discriminative nearest neighbor classification with deep self-attention.
We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z) - Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM
Networks [3.8382752162527933]
In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model.
We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords.
arXiv Detail & Related papers (2020-02-25T13:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.