Lite Audio-Visual Speech Enhancement
- URL: http://arxiv.org/abs/2005.11769v3
- Date: Tue, 18 Aug 2020 13:33:54 GMT
- Title: Lite Audio-Visual Speech Enhancement
- Authors: Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo and Hsin-Min Wang
- Abstract summary: Two problems may be encountered when implementing an audio-visual SE (AVSE) system.
Additional processing costs are incurred to incorporate visual input.
The use of face or lip images may cause privacy problems.
We propose a Lite AVSE (LAVSE) system to address these problems.
- Score: 25.91075607254046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous studies have confirmed the effectiveness of incorporating visual
information into speech enhancement (SE) systems. Despite improved denoising
performance, two problems may be encountered when implementing an audio-visual
SE (AVSE) system: (1) additional processing costs are incurred to incorporate
visual input and (2) the use of face or lip images may cause privacy problems.
In this study, we propose a Lite AVSE (LAVSE) system to address these problems.
The system includes two visual data compression techniques and removes the
visual feature extraction network from the training model, yielding better
online computation efficiency. Our experimental results indicate that the
proposed LAVSE system can provide notably better performance than an audio-only
SE system with a similar number of model parameters. In addition, the
experimental results confirm the effectiveness of the two techniques for visual
data compression.
Related papers
- Speaker-Adapted End-to-End Visual Speech Recognition for Continuous
Spanish [0.0]
This paper studies how estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition.
Results comparable to the current state of the art were reached even when only a limited amount of data was available.
arXiv Detail & Related papers (2023-11-21T09:44:33Z) - Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training [102.18680666349806]
We propose a speed co-augmentation method that randomly changes the playback speeds of both audio and video data.
Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
arXiv Detail & Related papers (2023-09-25T08:22:30Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Egocentric Audio-Visual Noise Suppression [11.113020254726292]
This paper studies audio-visual noise suppression for egocentric videos.
Video camera emulates off-screen speaker's view of the outside world.
We first demonstrate that egocentric visual information is helpful for noise suppression.
arXiv Detail & Related papers (2022-11-07T15:53:12Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - An Empirical Study of Visual Features for DNN based Audio-Visual Speech
Enhancement in Multi-talker Environments [5.28539620288341]
AVSE methods use both audio and visual features for the task of speech enhancement.
To the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task.
Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing make their use difficult in low resource systems.
arXiv Detail & Related papers (2020-11-09T11:48:14Z) - Improved Lite Audio-Visual Speech Enhancement [27.53117725152492]
We propose a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario.
In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems.
We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset.
arXiv Detail & Related papers (2020-08-30T17:29:19Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.