Multi-Temporal Lip-Audio Memory for Visual Speech Recognition
- URL: http://arxiv.org/abs/2305.04542v1
- Date: Mon, 8 May 2023 08:30:52 GMT
- Title: Multi-Temporal Lip-Audio Memory for Visual Speech Recognition
- Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro
- Abstract summary: We present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements.
The proposed method is mainly composed of two parts: 1) MTLAM saves multi-temporal audio features produced from short- and long-term audio signals, and the MTLAM memorizes a visual-to-audio mapping to load stored multi-temporal audio features from visual features at the inference phase.
- Score: 32.65865343643458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Speech Recognition (VSR) is a task to predict a sentence or word from
lip movements. Some works have been recently presented which use audio signals
to supplement visual information. However, existing methods utilize only
limited information such as phoneme-level features and soft labels of Automatic
Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal
Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement
insufficient information of lip movements. The proposed method is mainly
composed of two parts: 1) MTLAM saves multi-temporal audio features produced
from short- and long-term audio signals, and the MTLAM memorizes a
visual-to-audio mapping to load stored multi-temporal audio features from
visual features at the inference phase. 2) We design an audio temporal model to
produce multi-temporal audio features capturing the context of neighboring
words. In addition, to construct effective visual-to-audio mapping, the audio
temporal models can generate audio features time-aligned with visual features.
Through extensive experiments, we validate the effectiveness of the MTLAM
achieving state-of-the-art performances on two public VSR datasets.
Related papers
- Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition [29.414663568089292]
Audio-visual speech recognition aims to transcribe human speech using both audio and video modalities.
In this study, we strengthen the video features by learning three temporal dynamics in video data.
We achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings.
arXiv Detail & Related papers (2024-07-04T01:25:20Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model [53.492751392755636]
We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality.
We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
arXiv Detail & Related papers (2023-08-15T06:38:38Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via
Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio.
Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models.
We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z) - Dual-path Attention is All You Need for Audio-Visual Speech Extraction [34.7260610874298]
We propose a new way to fuse audio-visual features.
The proposed algorithm incorporates the visual features as an additional feature stream.
Results show we achieve superior results compared with other time-domain based audio-visual fusion models.
arXiv Detail & Related papers (2022-07-09T07:27:46Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Discriminative Multi-modality Speech Recognition [17.296404414250553]
Vision is often used as a complementary modality for audio speech recognition (ASR)
In this paper, we propose a two-stage speech recognition model.
In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly.
At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate.
arXiv Detail & Related papers (2020-05-12T07:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.