LiteVSR: Efficient Visual Speech Recognition by Learning from Speech
Representations of Unlabeled Data
- URL: http://arxiv.org/abs/2312.09727v1
- Date: Fri, 15 Dec 2023 12:04:24 GMT
- Title: LiteVSR: Efficient Visual Speech Recognition by Learning from Speech
Representations of Unlabeled Data
- Authors: Hendrik Laux, Emil Mededovic, Ahmed Hallawa, Lukas Martin, Arne Peine,
Anke Schmeink
- Abstract summary: Our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks.
Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware.
- Score: 9.049193356646635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel, resource-efficient approach to Visual Speech
Recognition (VSR) leveraging speech representations produced by any trained
Automatic Speech Recognition (ASR) model. Moving away from the
resource-intensive trends prevalent in recent literature, our method distills
knowledge from a trained Conformer-based ASR model, achieving competitive
performance on standard VSR benchmarks with significantly less resource
utilization. Using unlabeled audio-visual data only, our baseline model
achieves a word error rate (WER) of 47.4% and 54.7% on the LRS2 and LRS3 test
benchmarks, respectively. After fine-tuning the model with limited labeled
data, the word error rate reduces to 35% (LRS2) and 45.7% (LRS3). Our model can
be trained on a single consumer-grade GPU within a few days and is capable of
performing real-time end-to-end VSR on dated hardware, suggesting a path
towards more accessible and resource-efficient VSR methodologies.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Enhancing CTC-Based Visual Speech Recognition [11.269066294359144]
LiteVSR2 is an enhanced version of our previously introduced efficient approach to Visual Speech Recognition.
We introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process.
LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy.
arXiv Detail & Related papers (2024-09-11T12:02:42Z) - SynthVSR: Scaling Up Visual Speech Recognition With Synthetic
Supervision [60.54020550732634]
We study the potential of leveraging synthetic visual data for visual speech recognition (VSR)
Key idea is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech.
We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3)
arXiv Detail & Related papers (2023-03-30T07:43:27Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Multi-task Language Modeling for Improving Speech Recognition of Rare
Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance.
Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.