An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation
- URL: http://arxiv.org/abs/2008.09586v2
- Date: Fri, 12 Mar 2021 22:27:01 GMT
- Title: An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation
- Authors: Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu,
Dong Yu, and Jesper Jensen
- Abstract summary: Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
- Score: 57.68765353264689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance.
Related papers
- Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction [13.5641621193917]
In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance.
Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production.
We introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements.
arXiv Detail & Related papers (2024-04-19T09:08:44Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.