Lipreading using Temporal Convolutional Networks
- URL: http://arxiv.org/abs/2001.08702v1
- Date: Thu, 23 Jan 2020 17:49:35 GMT
- Title: Lipreading using Temporal Convolutional Networks
- Authors: Brais Martinez, Pingchuan Ma, Stavros Petridis, Maja Pantic
- Abstract summary: Current model for recognition of isolated words in-the-wild consists of a residual network and Bi-directional Gated Recurrent Unit layers.
We address the limitations of this model and we propose changes which further improve its performance.
Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets.
- Score: 57.41253104365274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip-reading has attracted a lot of research attention lately thanks to
advances in deep learning. The current state-of-the-art model for recognition
of isolated words in-the-wild consists of a residual network and Bidirectional
Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of
this model and we propose changes which further improve its performance.
Firstly, the BGRU layers are replaced with Temporal Convolutional Networks
(TCN). Secondly, we greatly simplify the training procedure, which allows us to
train the model in one single stage. Thirdly, we show that the current
state-of-the-art methodology produces models that do not generalize well to
variations on the sequence length, and we addresses this issue by proposing a
variable-length augmentation. We present results on the largest
publicly-available datasets for isolated word recognition in English and
Mandarin, LRW and LRW1000, respectively. Our proposed model results in an
absolute improvement of 1.2% and 3.2%, respectively, in these datasets which is
the new state-of-the-art performance.
Related papers
- Bidirectional Awareness Induction in Autoregressive Seq2Seq Models [47.82947878753809]
Bidirectional Awareness Induction (BAI) is a training method that leverages a subset of elements in the network, the Pivots, to perform bidirectional learning without breaking the autoregressive constraints.
In particular, we observed an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in Neural Machine Translation, and 1.16 ROUGE in Text Summarization compared to the respective baselines.
arXiv Detail & Related papers (2024-08-25T23:46:35Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - FOSTER: Feature Boosting and Compression for Class-Incremental Learning [52.603520403933985]
Deep neural networks suffer from catastrophic forgetting when learning new categories.
We propose a novel two-stage learning paradigm FOSTER, empowering the model to learn new categories adaptively.
arXiv Detail & Related papers (2022-04-10T11:38:33Z) - DTW-Merge: A Novel Data Augmentation Technique for Time Series
Classification [6.091096843566857]
This paper proposes a novel data augmentation method for time series based on Dynamic Time Warping.
Exploiting the proposed approach with recently-introduced ResNet reveals the improvement of results on the 2018 UCR Time Series Classification Archive.
arXiv Detail & Related papers (2021-03-01T16:40:47Z) - Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention [98.52189797347354]
We introduce multi-scale processing into the spatial feature extraction for lip-reading.
We merge information in all time steps of the sequence by utilizing self-attention.
Our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art.
arXiv Detail & Related papers (2020-12-28T16:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.