Improved Lite Audio-Visual Speech Enhancement
- URL: http://arxiv.org/abs/2008.13222v3
- Date: Mon, 31 Jan 2022 19:57:05 GMT
- Title: Improved Lite Audio-Visual Speech Enhancement
- Authors: Shang-Yi Chuang, Hsin-Min Wang and Yu Tsao
- Abstract summary: We propose a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario.
In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems.
We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset.
- Score: 27.53117725152492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous studies have investigated the effectiveness of audio-visual
multimodal learning for speech enhancement (AVSE) tasks, seeking a solution
that uses visual data as auxiliary and complementary input to reduce the noise
of noisy speech signals. Recently, we proposed a lite audio-visual speech
enhancement (LAVSE) algorithm for a car-driving scenario. Compared to
conventional AVSE systems, LAVSE requires less online computation and to some
extent solves the user privacy problem on facial data. In this study, we extend
LAVSE to improve its ability to address three practical issues often
encountered in implementing AVSE systems, namely, the additional cost of
processing visual data, audio-visual asynchronization, and low-quality visual
data. The proposed system is termed improved LAVSE (iLAVSE), which uses a
convolutional recurrent neural network architecture as the core AVSE model. We
evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental
results confirm that compared to conventional AVSE systems, iLAVSE can
effectively overcome the aforementioned three practical issues and can improve
enhancement performance. The results also confirm that iLAVSE is suitable for
real-world scenarios, where high-quality audio-visual sensors may not always be
available.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - An Empirical Study of Visual Features for DNN based Audio-Visual Speech
Enhancement in Multi-talker Environments [5.28539620288341]
AVSE methods use both audio and visual features for the task of speech enhancement.
To the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task.
Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing make their use difficult in low resource systems.
arXiv Detail & Related papers (2020-11-09T11:48:14Z) - Lite Audio-Visual Speech Enhancement [25.91075607254046]
Two problems may be encountered when implementing an audio-visual SE (AVSE) system.
Additional processing costs are incurred to incorporate visual input.
The use of face or lip images may cause privacy problems.
We propose a Lite AVSE (LAVSE) system to address these problems.
arXiv Detail & Related papers (2020-05-24T15:09:42Z) - How to Teach DNNs to Pay Attention to the Visual Modality in Speech
Recognition [10.74796391075403]
This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns.
We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern.
We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
arXiv Detail & Related papers (2020-04-17T13:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.