VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement
- URL: http://arxiv.org/abs/2211.10666v1
- Date: Sat, 19 Nov 2022 11:12:01 GMT
- Title: VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement
- Authors: Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, Zhou Zhao
- Abstract summary: We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
- Score: 68.42632589736881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video to sound generation aims to generate realistic and natural sound given
a video input. However, previous video-to-sound generation methods can only
generate a random or average timbre without any controls or specializations of
the generated sound timbre, leading to the problem that people cannot obtain
the desired timbre under these methods sometimes. In this paper, we pose the
task of generating sound with a specific timbre given a video input and a
reference audio sample. To solve this task, we disentangle each target sound
audio into three components: temporal information, acoustic information, and
background information. We first use three encoders to encode these components
respectively: 1) a temporal encoder to encode temporal information, which is
fed with video frames since the input video shares the same temporal
information as the original audio; 2) an acoustic encoder to encode timbre
information, which takes the original audio as input and discards its temporal
information by a temporal-corrupting operation; and 3) a background encoder to
encode the residual or background sound, which uses the background part of the
original audio as input. To make the generated result achieve better quality
and temporal alignment, we also adopt a mel discriminator and a temporal
discriminator for the adversarial training. Our experimental results on the VAS
dataset demonstrate that our method can generate high-quality audio samples
with good synchronization with events in video and high timbre similarity with
the reference audio.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z) - The Power of Sound (TPoS): Audio Reactive Video Generation with Stable
Diffusion [23.398304611826642]
We propose The Power of Sound model to incorporate audio input that includes both changeable temporal semantics and magnitude.
To generate video frames, TPoS utilizes a latent stable diffusion model with semantic information, which is then guided by the sequential audio embedding.
We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation.
arXiv Detail & Related papers (2023-09-08T12:21:01Z) - Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.