Speech inpainting: Context-based speech synthesis guided by video
- URL: http://arxiv.org/abs/2306.00489v1
- Date: Thu, 1 Jun 2023 09:40:47 GMT
- Title: Speech inpainting: Context-based speech synthesis guided by video
- Authors: Juan F. Montesinos and Daniel Michelsanti and Gloria Haro and
Zheng-Hua Tan and Jesper Jensen
- Abstract summary: This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
- Score: 29.233167442719676
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Audio and visual modalities are inherently connected in speech signals: lip
movements and facial expressions are correlated with speech sounds. This
motivates studies that incorporate the visual modality to enhance an acoustic
speech signal or even restore missing audio information. Specifically, this
paper focuses on the problem of audio-visual speech inpainting, which is the
task of synthesizing the speech in a corrupted audio segment in a way that it
is consistent with the corresponding visual content and the uncorrupted audio
context. We present an audio-visual transformer-based deep learning model that
leverages visual cues that provide information about the content of the
corrupted audio. It outperforms the previous state-of-the-art audio-visual
model and audio-only baselines. We also show how visual features extracted with
AV-HuBERT, a large audio-visual transformer for speech recognition, are
suitable for synthesizing speech.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - AudioViewer: Learning to Visualize Sound [12.71759722609666]
We aim to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech.
Our design is to translate from audio to video by compressing both into a common latent space with shared structure.
arXiv Detail & Related papers (2020-12-22T21:52:45Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.