The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024
- URL: http://arxiv.org/abs/2408.02369v3
- Date: Thu, 12 Sep 2024 15:46:58 GMT
- Title: The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024
- Authors: He Wang, Lei Xie,
- Abstract summary: This paper delineates the visual speech recognition system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024)
In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multiscale video data.
Various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation.
Our approach yields a 30.47% CER for the Single-Speaker Task and 34.30% CER for the Multi-Speaker Task, securing second place in the open track of the Single-Speaker Task
- Score: 15.904649354308141
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multiscale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, introducing Enhanced ResNet3D visual frontend, E-Branchformer encoder, and Bi-directional Transformer decoder. Our approach yields a 30.47% CER for the Single-Speaker Task and 34.30% CER for the Multi-Speaker Task, securing second place in the open track of the Single-Speaker Task and first place in the other three tracks.
Related papers
- The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in
CNVSRC 2023 [67.11294606070278]
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data.
Various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation.
arXiv Detail & Related papers (2024-01-07T14:20:52Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Discriminative Multi-modality Speech Recognition [17.296404414250553]
Vision is often used as a complementary modality for audio speech recognition (ASR)
In this paper, we propose a two-stage speech recognition model.
In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly.
At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate.
arXiv Detail & Related papers (2020-05-12T07:56:03Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.