Sounding Video Generator: A Unified Framework for Text-guided Sounding
Video Generation
- URL: http://arxiv.org/abs/2303.16541v1
- Date: Wed, 29 Mar 2023 09:07:31 GMT
- Title: Sounding Video Generator: A Unified Framework for Text-guided Sounding
Video Generation
- Authors: Jiawei Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing Liu
- Abstract summary: Sounding Video Generator (SVG) is a unified framework for generating realistic videos along with audio signals.
VQGAN transforms visual frames and audio melspectrograms into discrete tokens.
Transformer-based decoder is used to model associations between texts, visual frames, and audio signals.
- Score: 24.403772976932487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a combination of visual and audio signals, video is inherently
multi-modal. However, existing video generation methods are primarily intended
for the synthesis of visual frames, whereas audio signals in realistic videos
are disregarded. In this work, we concentrate on a rarely investigated problem
of text guided sounding video generation and propose the Sounding Video
Generator (SVG), a unified framework for generating realistic videos along with
audio signals. Specifically, we present the SVG-VQGAN to transform visual
frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a
novel hybrid contrastive learning method to model inter-modal and intra-modal
consistency and improve the quantized representations. A cross-modal attention
module is employed to extract associated features of visual frames and audio
signals for contrastive learning. Then, a Transformer-based decoder is used to
model associations between texts, visual frames, and audio signals at token
level for auto-regressive sounding video generation. AudioSetCap, a human
annotated text-video-audio paired dataset, is produced for training SVG.
Experimental results demonstrate the superiority of our method when compared
with existing textto-video generation methods as well as audio generation
methods on Kinetics and VAS datasets.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Synthesizing Audio from Silent Video using Sequence to Sequence Modeling [0.0]
We propose a novel method to generate audio from video using a sequence-to-sequence model.
Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures.
Our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
arXiv Detail & Related papers (2024-04-25T22:19:42Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment [30.38594416942543]
We propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA.
Our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings.
Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.
arXiv Detail & Related papers (2023-05-22T10:37:27Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.