How to Teach DNNs to Pay Attention to the Visual Modality in Speech
Recognition
- URL: http://arxiv.org/abs/2004.08250v1
- Date: Fri, 17 Apr 2020 13:59:19 GMT
- Title: How to Teach DNNs to Pay Attention to the Visual Modality in Speech
Recognition
- Authors: George Sterpu, Christian Saam, Naomi Harte
- Abstract summary: This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns.
We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern.
We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
- Score: 10.74796391075403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit,
the dynamic relationship between a human voice and the corresponding mouth
movements. A recently proposed multimodal fusion strategy, AV Align, based on
state-of-the-art sequence to sequence neural networks, attempts to model this
relationship by explicitly aligning the acoustic and visual representations of
speech. This study investigates the inner workings of AV Align and visualises
the audio-visual alignment patterns. Our experiments are performed on two of
the largest publicly available AVSR datasets, TCD-TIMIT and LRS2. We find that
AV Align learns to align acoustic and visual representations of speech at the
frame level on TCD-TIMIT in a generally monotonic pattern. We also determine
the cause of initially seeing no improvement over audio-only speech recognition
on the more challenging LRS2. We propose a regularisation method which involves
predicting lip-related Action Units from visual representations. Our
regularisation method leads to better exploitation of the visual modality, with
performance improvements between 7% and 30% depending on the noise level.
Furthermore, we show that the alternative Watch, Listen, Attend, and Spell
network is affected by the same problem as AV Align, and that our proposed
approach can effectively help it learn visual representations. Our findings
validate the suitability of the regularisation method to AVSR and encourage
researchers to rethink the multimodal convergence problem when having one
dominant modality.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Cross-Modal Global Interaction and Local Alignment for Audio-Visual
Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR)
Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level.
Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.