Wake Word Detection with Streaming Transformers
- URL: http://arxiv.org/abs/2102.04488v1
- Date: Mon, 8 Feb 2021 19:14:32 GMT
- Title: Wake Word Detection with Streaming Transformers
- Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur
- Abstract summary: We show that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate.
Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25%.
- Score: 72.66551640048405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern wake word detection systems usually rely on neural networks for
acoustic modeling. Transformers has recently shown superior performance over
LSTM and convolutional networks in various sequence modeling tasks with their
better temporal modeling power. However it is not clear whether this advantage
still holds for short-range temporal modeling like wake word detection.
Besides, the vanilla Transformer is not directly applicable to the task due to
its non-streaming nature and the quadratic time and space complexity. In this
paper we explore the performance of several variants of chunk-wise streaming
Transformers tailored for wake word detection in a recently proposed LF-MMI
system, including looking-ahead to the next chunk, gradient stopping, different
positional embedding methods and adding same-layer dependency between chunks.
Our experiments on the Mobvoi wake word dataset demonstrate that our proposed
Transformer model outperforms the baseline convolution network by 25% on
average in false rejection rate at the same false alarm rate with a comparable
model size, while still maintaining linear complexity w.r.t. the sequence
length.
Related papers
- TSLANet: Rethinking Transformers for Time Series Representation Learning [19.795353886621715]
Time series data is characterized by its intrinsic long and short-range dependencies.
We introduce a novel Time Series Lightweight Network (TSLANet) as a universal convolutional model for diverse time series tasks.
Our experiments demonstrate that TSLANet outperforms state-of-the-art models in various tasks spanning classification, forecasting, and anomaly detection.
arXiv Detail & Related papers (2024-04-12T13:41:29Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer.
Our method, named DeMT, is based on a simple and effective encoder-decoder architecture.
Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z) - Transformer-based conditional generative adversarial network for
multivariate time series generation [0.0]
Conditional generation of time-dependent data is a task that has much interest.
Recent works proposed a Transformer-based Time series generative adversarial network (TTS-GAN)
We extend the TTS-GAN by conditioning its generated output on a particular encoded context.
We show that this transformer-based CGAN can generate realistic high-dimensional and long data sequences under different kinds of conditions.
arXiv Detail & Related papers (2022-10-05T08:29:33Z) - DT-SV: A Transformer-based Time-domain Approach for Speaker Verification [24.613926376221155]
Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech.
We propose an approach to derive utterance-level speaker embeddings via a Transformer architecture.
We also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor.
arXiv Detail & Related papers (2022-05-26T09:36:26Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Learning to Encode Position for Transformer with Continuous Dynamical
Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models.
We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.