Enhancing CTC-Based Visual Speech Recognition
- URL: http://arxiv.org/abs/2409.07210v1
- Date: Wed, 11 Sep 2024 12:02:42 GMT
- Title: Enhancing CTC-Based Visual Speech Recognition
- Authors: Hendrik Laux, Anke Schmeink,
- Abstract summary: LiteVSR2 is an enhanced version of our previously introduced efficient approach to Visual Speech Recognition.
We introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process.
LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy.
- Score: 11.269066294359144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These improvements yield substantial performance gains on the LRS2 and LRS3 benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model without increasing the volume of training data or computational resources utilized. Furthermore, we explore the scalability of our approach by examining performance metrics across varying model complexities and training data volumes. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals.
We highlight the crucial importance of tailoring datasets to specific learning objectives.
We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z) - Efficient infusion of self-supervised representations in Automatic Speech Recognition [1.2972104025246092]
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks.
We propose two simple approaches that use framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model into the ASR architecture.
Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets.
arXiv Detail & Related papers (2024-04-19T05:01:12Z) - LiteVSR: Efficient Visual Speech Recognition by Learning from Speech
Representations of Unlabeled Data [9.049193356646635]
Our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks.
Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware.
arXiv Detail & Related papers (2023-12-15T12:04:24Z) - RA-DIT: Retrieval-Augmented Dual Instruction Tuning [90.98423540361946]
Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores.
Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance.
We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option.
arXiv Detail & Related papers (2023-10-02T17:16:26Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - SiRi: A Simple Selective Retraining Mechanism for Transformer-based
Visual Grounding [131.0977050185209]
Selective Retraining (SiRi) can significantly outperform previous approaches on three popular benchmarks.
SiRi performs surprisingly superior even with limited training data.
We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.
arXiv Detail & Related papers (2022-07-27T07:01:01Z) - Heterogeneous Reservoir Computing Models for Persian Speech Recognition [0.0]
Reservoir computing models (RC) models have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies.
We propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales.
arXiv Detail & Related papers (2022-05-25T09:15:15Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Distributed Training of Deep Neural Network Acoustic Models for
Automatic Speech Recognition [33.032361181388886]
We provide an overview of distributed training techniques for deep neural network acoustic models for ASR.
Experiments are carried out on a popular public benchmark to study the convergence, speedup and recognition performance of the investigated strategies.
arXiv Detail & Related papers (2020-02-24T19:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.