Generating Visually Aligned Sound from Videos
- URL: http://arxiv.org/abs/2008.00820v1
- Date: Tue, 14 Jul 2020 07:51:06 GMT
- Title: Generating Visually Aligned Sound from Videos
- Authors: Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang,
Chuang Gan
- Abstract summary: We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
- Score: 83.89485254543888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on the task of generating sound from natural videos, and the sound
should be both temporally and content-wise aligned with visual signals. This
task is extremely challenging because some sounds generated \emph{outside} a
camera can not be inferred from video content. The model may be forced to learn
an incorrect mapping between visual content and these irrelevant sounds. To
address this challenge, we propose a framework named REGNET. In this framework,
we first extract appearance and motion features from video frames to better
distinguish the object that emits sound from complex background information. We
then introduce an innovative audio forwarding regularizer that directly
considers the real sound as input and outputs bottlenecked sound features.
Using both visual and bottlenecked sound features for sound prediction during
training provides stronger supervision for the sound prediction. The audio
forwarding regularizer can control the irrelevant sound component and thus
prevent the model from learning an incorrect mapping between video frames and
sound emitted by the object that is out of the screen. During testing, the
audio forwarding regularizer is removed to ensure that REGNET can produce
purely aligned sound only from visual features. Extensive evaluations based on
Amazon Mechanical Turk demonstrate that our method significantly improves both
temporal and content-wise alignment. Remarkably, our generated sound can fool
the human with a 68.12% success rate. Code and pre-trained models are publicly
available at https://github.com/PeihaoChen/regnet
Related papers
- Self-Supervised Audio-Visual Soundscape Stylization [22.734359700809126]
We manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene.
Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures.
We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.
arXiv Detail & Related papers (2024-09-22T06:57:33Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - An Initial Exploration: Learning to Generate Realistic Audio for Silent
Video [0.0]
We develop a framework that observes video in it's natural sequence and generates realistic audio to accompany it.
Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs.
We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively.
arXiv Detail & Related papers (2023-08-23T20:08:56Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.