Deep Video Inpainting Guided by Audio-Visual Self-Supervision
- URL: http://arxiv.org/abs/2310.07663v1
- Date: Wed, 11 Oct 2023 17:03:21 GMT
- Title: Deep Video Inpainting Guided by Audio-Visual Self-Supervision
- Authors: Kyuyeon Kim, Junsik Jung, Woo Jae Kim, Sung-Eui Yoon
- Abstract summary: Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events.
In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting.
- Score: 25.841796702924444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans can easily imagine a scene from auditory information based on their
prior knowledge of audio-visual events. In this paper, we mimic this innate
human ability in deep learning models to improve the quality of video
inpainting. To implement the prior knowledge, we first train the audio-visual
network, which learns the correspondence between auditory and visual
information. Then, the audio-visual network is employed as a guider that
conveys the prior knowledge of audio-visual correspondence to the video
inpainting network. This prior knowledge is transferred through our proposed
two novel losses: audio-visual attention loss and audio-visual pseudo-class
consistency loss. These two losses further improve the performance of the video
inpainting by encouraging the inpainting result to have a high correspondence
to its synchronized audio. Experimental results demonstrate that our proposed
method can restore a wider domain of video scenes and is particularly effective
when the sounding object in the scene is partially blinded.
Related papers
- AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection [2.985620880452743]
We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method for improved deepfake detection.
To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual masking and feature fusion strategy.
We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.
arXiv Detail & Related papers (2024-06-05T05:20:12Z) - DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique.
We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video.
Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z) - Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Audio-Visual Speech Inpainting with Deep Learning [30.59696039318939]
We inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration.
Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large.
We show that multi-task learning is effective, although the largest contribution to performance comes from vision.
arXiv Detail & Related papers (2020-10-09T13:23:01Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.