QbyE-MLPMixer: Query-by-Example Open-Vocabulary Keyword Spotting using
MLPMixer
- URL: http://arxiv.org/abs/2206.13231v1
- Date: Thu, 23 Jun 2022 18:18:44 GMT
- Title: QbyE-MLPMixer: Query-by-Example Open-Vocabulary Keyword Spotting using
MLPMixer
- Authors: Jinmiao Huang, Waseem Gharbieh, Qianhui Wan, Han Suk Shim, Chul Lee
- Abstract summary: Current keyword spotting systems are typically trained with a large amount of pre-defined keywords.
We propose a pure-vocabulary neural network that is based on theMixer model architecture.
Our proposed model has a smaller number of parameters and MACs compared to the baseline models.
- Score: 10.503972720941693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current keyword spotting systems are typically trained with a large amount of
pre-defined keywords. Recognizing keywords in an open-vocabulary setting is
essential for personalizing smart device interaction. Towards this goal, we
propose a pure MLP-based neural network that is based on MLPMixer - an MLP
model architecture that effectively replaces the attention mechanism in Vision
Transformers. We investigate different ways of adapting the MLPMixer
architecture to the QbyE open-vocabulary keyword spotting task. Comparisons
with the state-of-the-art RNN and CNN models show that our method achieves
better performance in challenging situations (10dB and 6dB environments) on
both the publicly available Hey-Snips dataset and a larger scale internal
dataset with 400 speakers. Our proposed model also has a smaller number of
parameters and MACs compared to the baseline models.
Related papers
- Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking [6.9366619419210656]
Transformers have established themselves as the leading neural network model in natural language processing.
Recent research has explored replacing attention modules with other mechanisms, including those described by MetaFormers.
This paper integrates Krotov's hierarchical associative memory with MetaFormers, enabling a comprehensive representation of the Transformer block.
arXiv Detail & Related papers (2024-06-18T02:42:19Z) - TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series
Forecasting [13.410217680999459]
Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions.
High memory and computing requirements pose a critical bottleneck for long-term forecasting.
We propose TSMixer, a lightweight neural architecture composed of multi-layer perceptron (MLP) modules.
arXiv Detail & Related papers (2023-06-14T06:26:23Z) - iMixer: hierarchical Hopfield network implies an invertible, implicit and iterative MLP-Mixer [2.5782420501870296]
We generalize studies on Hopfield networks and Transformer-like architecture to iMixer.
iMixer is a generalization that propagates forward from the output side to the input side.
We evaluate the model performance with various datasets on image classification tasks.
The results imply that the correspondence between the Hopfield networks and the Mixer models serves as a principle for understanding a broader class of Transformer-like architecture designs.
arXiv Detail & Related papers (2023-04-25T18:00:08Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - ActiveMLP: An MLP-like Architecture with Active Token Mixer [54.95923719553343]
This paper presents ActiveMLP, a general-like backbone for computer vision.
We propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one.
In this way, the spatial range of token-mixing is expanded and the way of token-mixing is reformed.
arXiv Detail & Related papers (2022-03-11T17:29:54Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - AS-MLP: An Axial Shifted MLP Architecture for Vision [50.11765148947432]
An Axial Shifted architecture (AS-MLP) is proposed in this paper.
By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different directions.
With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset.
arXiv Detail & Related papers (2021-07-18T08:56:34Z) - Rethinking Token-Mixing MLP for MLP-based Vision Backbone [34.47616917228978]
We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific.
It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
arXiv Detail & Related papers (2021-06-28T17:59:57Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.