AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model
- URL: http://arxiv.org/abs/2308.07593v2
- Date: Fri, 12 Jan 2024 03:29:20 GMT
- Title: AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model
- Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, and Yong Man Ro
- Abstract summary: We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality.
We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
- Score: 53.492751392755636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Speech Recognition (VSR) is the task of predicting spoken words from
silent lip movements. VSR is regarded as a challenging task because of the
insufficient information on lip movements. In this paper, we propose an Audio
Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement
the insufficient speech information of visual modality by using audio modality.
Different from the previous methods, the proposed AKVSR 1) utilizes rich audio
knowledge encoded by a large-scale pretrained audio model, 2) saves the
linguistic information of audio knowledge in compact audio memory by discarding
the non-linguistic information from the audio through quantization, and 3)
includes Audio Bridging Module which can find the best-matched audio features
from the compact audio memory, which makes our training possible without audio
inputs, once after the compact audio memory is composed. We validate the
effectiveness of the proposed method through extensive experiments, and achieve
new state-of-the-art performances on the widely-used LRS3 dataset.
Related papers
- Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition [29.414663568089292]
Audio-visual speech recognition aims to transcribe human speech using both audio and video modalities.
In this study, we strengthen the video features by learning three temporal dynamics in video data.
We achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings.
arXiv Detail & Related papers (2024-07-04T01:25:20Z) - XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition [27.58390468474957]
We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL)
AV-CPL is a semi-supervised method to train an audio-visual speech recognition model on a combination of labeled and unlabeled videos.
Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels.
arXiv Detail & Related papers (2023-09-29T16:57:21Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Multi-Temporal Lip-Audio Memory for Visual Speech Recognition [32.65865343643458]
We present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements.
The proposed method is mainly composed of two parts: 1) MTLAM saves multi-temporal audio features produced from short- and long-term audio signals, and the MTLAM memorizes a visual-to-audio mapping to load stored multi-temporal audio features from visual features at the inference phase.
arXiv Detail & Related papers (2023-05-08T08:30:52Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.