Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling
- URL: http://arxiv.org/abs/2001.09246v1
- Date: Sat, 25 Jan 2020 01:19:19 GMT
- Title: Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling
- Authors: Hyun-Jin Park, Patrick Violette, Niranjan Subrahmanya
- Abstract summary: We propose smoothed max pooling loss and its application to keyword spotting systems.
The proposed approach jointly trains an encoder (to detect keyword parts) and a decoder (to detect whole keyword) in a semi-supervised manner.
The proposed new loss function allows training a model to detect parts and whole of a keyword, without strictly depending on frame-level labeling from LVCSR.
- Score: 9.927306160740974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose smoothed max pooling loss and its application to keyword spotting
systems. The proposed approach jointly trains an encoder (to detect keyword
parts) and a decoder (to detect whole keyword) in a semi-supervised manner. The
proposed new loss function allows training a model to detect parts and whole of
a keyword, without strictly depending on frame-level labeling from LVCSR (Large
vocabulary continuous speech recognition), making further optimization
possible. The proposed system outperforms the baseline keyword spotting model
in [1] due to increased optimizability. Further, it can be more easily adapted
for on-device learning applications due to reduced dependency on LVCSR.
Related papers
- MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone.
MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - Open-vocabulary Keyword-spotting with Adaptive Instance Normalization [18.250276540068047]
We propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters.
We show significant improvements over recent keyword spotting and ASR baselines.
arXiv Detail & Related papers (2023-09-13T13:49:42Z) - Few-Shot Open-Set Learning for On-Device Customization of KeyWord
Spotting Systems [41.24728444810133]
This paper investigates few-shot learning methods for open-set KWS classification by combining a deep feature encoder with a prototype-based classifier.
With user-defined keywords from 10 classes of the Google Speech Command dataset, our study reports an accuracy of up to 76% in a 10-shot scenario.
arXiv Detail & Related papers (2023-06-03T17:10:33Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting [6.4423565043274795]
We tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC.
We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets)
We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
arXiv Detail & Related papers (2022-06-28T01:56:24Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Discriminative Nearest Neighbor Few-Shot Intent Detection by
Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity.
We present a discriminative nearest neighbor classification with deep self-attention.
We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z) - Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM
Networks [3.8382752162527933]
In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model.
We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords.
arXiv Detail & Related papers (2020-02-25T13:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.