Audio-Visual Speech Inpainting with Deep Learning
- URL: http://arxiv.org/abs/2010.04556v2
- Date: Wed, 3 Feb 2021 11:36:56 GMT
- Title: Audio-Visual Speech Inpainting with Deep Learning
- Authors: Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen
- Abstract summary: We inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration.
Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large.
We show that multi-task learning is effective, although the largest contribution to performance comes from vision.
- Score: 30.59696039318939
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a deep-learning-based framework for audio-visual
speech inpainting, i.e., the task of restoring the missing parts of an acoustic
speech signal from reliable audio context and uncorrupted visual information.
Recent work focuses solely on audio-only methods and generally aims at
inpainting music signals, which show highly different structure than speech.
Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to
investigate the contribution that vision can provide for gaps of different
duration. We also experiment with a multi-task learning approach where a phone
recognition task is learned together with speech inpainting. Results show that
the performance of audio-only speech inpainting approaches degrades rapidly
when gaps get large, while the proposed audio-visual approach is able to
plausibly restore missing information. In addition, we show that multi-task
learning is effective, although the largest contribution to performance comes
from vision.
Related papers
- Deep Video Inpainting Guided by Audio-Visual Self-Supervision [25.841796702924444]
Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events.
In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting.
arXiv Detail & Related papers (2023-10-11T17:03:21Z) - Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning [3.6204417068568424]
We use dubbed versions of movies and television shows to augment cross-modal contrastive learning.
Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video.
arXiv Detail & Related papers (2023-04-12T04:17:45Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.