Related papers: Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

URL: http://arxiv.org/abs/2207.05692v1
Date: Sun, 5 Jun 2022 15:47:54 GMT
Title: Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models
Authors: Hadeel Mabrouk, Omar Abugabal, Nourhan Sakr, and Hesham M. Eraqi
Abstract summary: This work builds on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems. We propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training.
Score: 0.03499870393443267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.

Related papers

Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement [36.136070412464214]
Speech Enhancement (SE) aims to improve the quality of noisy speech. We propose a novel multi-modality learning framework for SE. We show that our proposed AVSE system significantly enhances speech quality and reduces generative artifacts.
arXiv Detail & Related papers (2025-01-23T04:36:29Z)
Late fusion ensembles for speech recognition on diverse input audio representations [0.0]
We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models. We show that improvements of $1% - 14%$ can still be achieved over the state-of-the-art models trained using comparable techniques.
arXiv Detail & Related papers (2024-12-01T10:19:24Z)
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech. We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. We propose a novel training strategy, processing with visual speech units. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z)
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition [27.58390468474957]
We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL) AV-CPL is a semi-supervised method to train an audio-visual speech recognition model on a combination of labeled and unlabeled videos. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels.
arXiv Detail & Related papers (2023-09-29T16:57:21Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method. We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading [0.06157382820537718]
We propose a new lip-reading model that combines three contributions. On LRW lip-reading dataset benchmark, a noticeable accuracy improvement is demonstrated.
arXiv Detail & Related papers (2021-08-07T23:46:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.