Wake Word Detection with Streaming Transformers
- URL: http://arxiv.org/abs/2102.04488v1
- Date: Mon, 8 Feb 2021 19:14:32 GMT
- Title: Wake Word Detection with Streaming Transformers
- Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur
- Abstract summary: We show that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate.
Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25%.
- Score: 72.66551640048405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern wake word detection systems usually rely on neural networks for
acoustic modeling. Transformers has recently shown superior performance over
LSTM and convolutional networks in various sequence modeling tasks with their
better temporal modeling power. However it is not clear whether this advantage
still holds for short-range temporal modeling like wake word detection.
Besides, the vanilla Transformer is not directly applicable to the task due to
its non-streaming nature and the quadratic time and space complexity. In this
paper we explore the performance of several variants of chunk-wise streaming
Transformers tailored for wake word detection in a recently proposed LF-MMI
system, including looking-ahead to the next chunk, gradient stopping, different
positional embedding methods and adding same-layer dependency between chunks.
Our experiments on the Mobvoi wake word dataset demonstrate that our proposed
Transformer model outperforms the baseline convolution network by 25% on
average in false rejection rate at the same false alarm rate with a comparable
model size, while still maintaining linear complexity w.r.t. the sequence
length.
Related papers
- TIMBA: Time series Imputation with Bi-directional Mamba Blocks and Diffusion models [0.0]
We propose replacing time-oriented Transformers with State-Space Models (SSM)
We develop a model that integrates SSM, Graph Neural Networks, and node-oriented Transformers to achieve enhanced representations.
arXiv Detail & Related papers (2024-10-08T11:10:06Z) - sTransformer: A Modular Approach for Extracting Inter-Sequential and Temporal Information for Time-Series Forecasting [6.434378359932152]
We review and categorize existing Transformer-based models into two main types: (1) modifications to the model structure and (2) modifications to the input data.
We propose $textbfsTransformer$, which introduces the Sequence and Temporal Convolutional Network (STCN) to fully capture both sequential and temporal information.
We compare our model with linear models and existing forecasting models on long-term time-series forecasting, achieving new state-of-the-art results.
arXiv Detail & Related papers (2024-08-19T06:23:41Z) - TSLANet: Rethinking Transformers for Time Series Representation Learning [19.795353886621715]
Time series data is characterized by its intrinsic long and short-range dependencies.
We introduce a novel Time Series Lightweight Network (TSLANet) as a universal convolutional model for diverse time series tasks.
Our experiments demonstrate that TSLANet outperforms state-of-the-art models in various tasks spanning classification, forecasting, and anomaly detection.
arXiv Detail & Related papers (2024-04-12T13:41:29Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer.
Our method, named DeMT, is based on a simple and effective encoder-decoder architecture.
Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Learning to Encode Position for Transformer with Continuous Dynamical
Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models.
We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.