FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire
- URL: http://arxiv.org/abs/2008.02516v4
- Date: Mon, 15 Mar 2021 07:23:19 GMT
- Title: FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire
- Authors: Jinglin Liu, Yi Ren, Zhou Zhao, Chen Zhang, Baoxing Huai, Nicholas
Jing Yuan
- Abstract summary: We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously.
FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
- Score: 74.04394069262108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lipreading is an impressive technique and there has been a definite
improvement of accuracy in recent years. However, existing methods for
lipreading mainly build on autoregressive (AR) model, which generate target
tokens one by one and suffer from high inference latency. To breakthrough this
constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model
which generates all target tokens simultaneously. NAR lipreading is a
challenging task that has many difficulties: 1) the discrepancy of sequence
lengths between source and target makes it difficult to estimate the length of
the output sequence; 2) the conditionally independent behavior of NAR
generation lacks the correlation across time which leads to a poor
approximation of target distribution; 3) the feature representation ability of
encoder can be weak due to lack of effective alignment mechanism; and 4) the
removal of AR language model exacerbates the inherent ambiguity problem of
lipreading. Thus, in this paper, we introduce three methods to reduce the gap
between FastLR and AR model: 1) to address challenges 1 and 2, we leverage
integrate-and-fire (I\&F) module to model the correspondence between source
video frames and output text sequence. 2) To tackle challenge 3, we add an
auxiliary connectionist temporal classification (CTC) decoder to the top of the
encoder and optimize it with extra CTC loss. We also add an auxiliary
autoregressive decoder to help the feature extraction of encoder. 3) To
overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I\&F
and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit
that FastLR achieves the speedup up to 10.97$\times$ comparing with
state-of-the-art lipreading model with slight WER absolute increase of 1.5\%
and 5.5\% on GRID and LRS2 lipreading datasets respectively, which demonstrates
the effectiveness of our proposed method.
Related papers
- LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.
LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.76times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z) - Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask [74.64216073678617]
AMD performs parallel NAR inference within contiguous blocks of output labels concealed using attention masks.
A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities.
Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x.
arXiv Detail & Related papers (2024-06-14T13:42:38Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Improving Dual-Encoder Training through Dynamic Indexes for Negative
Mining [61.09807522366773]
We introduce an algorithm that approximates the softmax with provable bounds and that dynamically maintains the tree.
In our study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining.
arXiv Detail & Related papers (2023-03-27T15:18:32Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Highly Parallel Autoregressive Entity Linking with Discriminative
Correction [51.947280241185]
We propose a very efficient approach that parallelizes autoregressive linking across all potential mentions.
Our model is >70 times faster and more accurate than the previous generative method.
arXiv Detail & Related papers (2021-09-08T17:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.