Streaming Audio-Visual Speech Recognition with Alignment Regularization
- URL: http://arxiv.org/abs/2211.02133v2
- Date: Sun, 2 Jul 2023 00:33:36 GMT
- Title: Streaming Audio-Visual Speech Recognition with Alignment Regularization
- Authors: Pingchuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja
Pantic
- Abstract summary: We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
- Score: 69.30185151873707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a streaming AV-ASR system based on a hybrid
connectionist temporal classification (CTC)/attention neural network
architecture. The audio and the visual encoder neural networks are both based
on the conformer architecture, which is made streamable using chunk-wise
self-attention (CSA) and causal convolution. Streaming recognition with a
decoder neural network is realized by using the triggered attention technique,
which performs time-synchronous decoding with joint CTC/attention scoring.
Additionally, we propose a novel alignment regularization technique that
promotes synchronization of the audio and visual encoder, which in turn results
in better word error rates (WERs) at all SNR levels for streaming and offline
AV-ASR models. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the
Lip Reading Sentences 3 (LRS3) dataset in an offline and online setup,
respectively, which both present state-of-the-art results when no external
training data are used.
Related papers
- Low-Latency Neural Stereo Streaming [6.49558286032794]
Low-Latency neural for Stereo video Streaming (LLSS) is a novel parallel stereo video coding method designed for low-latency stereo video streaming.
LLSS processes left and right views in parallel, minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs.
arXiv Detail & Related papers (2024-03-26T17:11:51Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Streaming Align-Refine for Non-autoregressive Deliberation [42.748839817396046]
We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model.
Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context.
Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart.
arXiv Detail & Related papers (2022-04-15T17:24:39Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - Attention Driven Fusion for Multi-Modal Emotion Recognition [39.295892047505816]
We present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification.
We use a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN.
For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations.
arXiv Detail & Related papers (2020-09-23T08:07:58Z) - End-to-End Lip Synchronisation Based on Pattern Classification [15.851638021923875]
We propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream.
We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.
arXiv Detail & Related papers (2020-05-18T11:42:32Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.