Query-by-Example Keyword Spotting system using Multi-head Attention and
Softtriple Loss
- URL: http://arxiv.org/abs/2102.07061v1
- Date: Sun, 14 Feb 2021 03:37:37 GMT
- Title: Query-by-Example Keyword Spotting system using Multi-head Attention and
Softtriple Loss
- Authors: Jinmiao Huang, Waseem Gharbieh, Han Suk Shim, Eugene Kim
- Abstract summary: This paper proposes a neural network architecture for tackling the query-by-example user-defined keyword spotting task.
A multi-head attention module is added on top of a multi-layered GRU for effective feature extraction.
We also adopt the softtriple loss - a combination of triplet loss and softmax loss - and showcase its effectiveness.
- Score: 1.179778723980276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a neural network architecture for tackling the
query-by-example user-defined keyword spotting task. A multi-head attention
module is added on top of a multi-layered GRU for effective feature extraction,
and a normalized multi-head attention module is proposed for feature
aggregation. We also adopt the softtriple loss - a combination of triplet loss
and softmax loss - and showcase its effectiveness. We demonstrate the
performance of our model on internal datasets with different languages and the
public Hey-Snips dataset. We compare the performance of our model to a baseline
system and conduct an ablation study to show the benefit of each component in
our architecture. The proposed work shows solid performance while preserving
simplicity.
Related papers
- Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging [11.70130626541926]
We propose a novel framework for learning cross-modality features to enhance matching and registration across multi-modality retinal images.
Our model draws on the success of previous learning-based feature detection and description methods.
It is trained in a self-supervised manner by enforcing segmentation consistency between different augmentations of the same image.
arXiv Detail & Related papers (2024-07-25T19:51:27Z) - Small, Versatile and Mighty: A Range-View Perception Framework [13.85089181673372]
We propose a novel multi-task framework for 3D detection of LiDAR data.
Our framework integrates semantic segmentation and panoptic segmentation tasks for the LiDAR point cloud.
Among range-view-based methods, our model achieves new state-of-the-art detection performances on the Open dataset.
arXiv Detail & Related papers (2024-03-01T07:02:42Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - Exploring Multi-dimensional Data via Subset Embedding [13.092303047029311]
We propose a visual analytics approach to exploring subset patterns.
The core of the approach is a subset embedding network (SEN) that represents a group of subsets as uniformly-formatted embeddings.
The design enables to handle arbitrary subsets and capture the similarity of subsets on single features.
arXiv Detail & Related papers (2021-04-24T03:08:08Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.