An Empirical Study of Visual Features for DNN based Audio-Visual Speech
Enhancement in Multi-talker Environments
- URL: http://arxiv.org/abs/2011.04359v1
- Date: Mon, 9 Nov 2020 11:48:14 GMT
- Title: An Empirical Study of Visual Features for DNN based Audio-Visual Speech
Enhancement in Multi-talker Environments
- Authors: Shrishti Saha Shetu, Soumitro Chakrabarty and Emanu\"el A. P. Habets
- Abstract summary: AVSE methods use both audio and visual features for the task of speech enhancement.
To the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task.
Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing make their use difficult in low resource systems.
- Score: 5.28539620288341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual speech enhancement (AVSE) methods use both audio and visual
features for the task of speech enhancement and the use of visual features has
been shown to be particularly effective in multi-speaker scenarios. In the
majority of deep neural network (DNN) based AVSE methods, the audio and visual
data are first processed separately using different sub-networks, and then the
learned features are fused to utilize the information from both modalities.
There have been various studies on suitable audio input features and network
architectures, however, to the best of our knowledge, there is no published
study that has investigated which visual features are best suited for this
specific task. In this work, we perform an empirical study of the most commonly
used visual features for DNN based AVSE, the pre-processing requirements for
each of these features, and investigate their influence on the performance. Our
study shows that despite the overall better performance of embedding-based
features, their computationally intensive pre-processing make their use
difficult in low resource systems. For such systems, optical flow or raw
pixels-based features might be better suited.
Related papers
- Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning [3.7161123856095837]
This paper addresses the problem of self-supervised general-purpose audio representation learning.
We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations.
arXiv Detail & Related papers (2024-05-14T15:00:09Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Improved Lite Audio-Visual Speech Enhancement [27.53117725152492]
We propose a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario.
In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems.
We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset.
arXiv Detail & Related papers (2020-08-30T17:29:19Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Lite Audio-Visual Speech Enhancement [25.91075607254046]
Two problems may be encountered when implementing an audio-visual SE (AVSE) system.
Additional processing costs are incurred to incorporate visual input.
The use of face or lip images may cause privacy problems.
We propose a Lite AVSE (LAVSE) system to address these problems.
arXiv Detail & Related papers (2020-05-24T15:09:42Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.