Related papers: LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

URL: http://arxiv.org/abs/2312.09727v1
Date: Fri, 15 Dec 2023 12:04:24 GMT
Title: LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data
Authors: Hendrik Laux, Emil Mededovic, Ahmed Hallawa, Lukas Martin, Arne Peine, Anke Schmeink
Abstract summary: Our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks. Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware.
Score: 9.049193356646635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. Moving away from the resource-intensive trends prevalent in recent literature, our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks with significantly less resource utilization. Using unlabeled audio-visual data only, our baseline model achieves a word error rate (WER) of 47.4% and 54.7% on the LRS2 and LRS3 test benchmarks, respectively. After fine-tuning the model with limited labeled data, the word error rate reduces to 35% (LRS2) and 45.7% (LRS3). Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware, suggesting a path towards more accessible and resource-efficient VSR methodologies.

Related papers

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance. We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z)
Enhancing CTC-Based Visual Speech Recognition [11.269066294359144]
LiteVSR2 is an enhanced version of our previously introduced efficient approach to Visual Speech Recognition. We introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy.
arXiv Detail & Related papers (2024-09-11T12:02:42Z)
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision [60.54020550732634]
We study the potential of leveraging synthetic visual data for visual speech recognition (VSR) Key idea is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3)
arXiv Detail & Related papers (2023-03-30T07:43:27Z)
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z)
Jointly Learning Visual and Auditory Speech Representations from Raw Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our design is asymmetric w.r.t. driven by the inherent differences between video and audio. RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z)
Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR) The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z)
Multi-task Language Modeling for Improving Speech Recognition of Rare Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance. Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z)
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.