Query-by-Example Keyword Spotting system using Multi-head Attention and
Softtriple Loss
- URL: http://arxiv.org/abs/2102.07061v1
- Date: Sun, 14 Feb 2021 03:37:37 GMT
- Title: Query-by-Example Keyword Spotting system using Multi-head Attention and
Softtriple Loss
- Authors: Jinmiao Huang, Waseem Gharbieh, Han Suk Shim, Eugene Kim
- Abstract summary: This paper proposes a neural network architecture for tackling the query-by-example user-defined keyword spotting task.
A multi-head attention module is added on top of a multi-layered GRU for effective feature extraction.
We also adopt the softtriple loss - a combination of triplet loss and softmax loss - and showcase its effectiveness.
- Score: 1.179778723980276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a neural network architecture for tackling the
query-by-example user-defined keyword spotting task. A multi-head attention
module is added on top of a multi-layered GRU for effective feature extraction,
and a normalized multi-head attention module is proposed for feature
aggregation. We also adopt the softtriple loss - a combination of triplet loss
and softmax loss - and showcase its effectiveness. We demonstrate the
performance of our model on internal datasets with different languages and the
public Hey-Snips dataset. We compare the performance of our model to a baseline
system and conduct an ablation study to show the benefit of each component in
our architecture. The proposed work shows solid performance while preserving
simplicity.
Related papers
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Iterative Feature Exclusion Ranking for Deep Tabular Learning [0.0]
Tabular data is a common format for storing information in rows and columns to represent data entries and their features.
Deep neural networks have become the main approach for modeling a wide range of domains including computer vision and NLP.
Recent models featuring an internal feature selection mechanism with end-to-end gradient-based optimization fail to account for the contextual dependence of feature importance.
The proposed module iteratively excludes each feature from the input data and computes the attention scores, which represent the impact of the features on the prediction.
arXiv Detail & Related papers (2024-12-21T02:12:14Z) - Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging [11.70130626541926]
We propose a novel framework for learning cross-modality features to enhance matching and registration across multi-modality retinal images.
Our model draws on the success of previous learning-based feature detection and description methods.
It is trained in a self-supervised manner by enforcing segmentation consistency between different augmentations of the same image.
arXiv Detail & Related papers (2024-07-25T19:51:27Z) - Small, Versatile and Mighty: A Range-View Perception Framework [13.85089181673372]
We propose a novel multi-task framework for 3D detection of LiDAR data.
Our framework integrates semantic segmentation and panoptic segmentation tasks for the LiDAR point cloud.
Among range-view-based methods, our model achieves new state-of-the-art detection performances on the Open dataset.
arXiv Detail & Related papers (2024-03-01T07:02:42Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.