Recurrent Deconvolutional Generative Adversarial Networks with
Application to Text Guided Video Generation
- URL: http://arxiv.org/abs/2008.05856v1
- Date: Thu, 13 Aug 2020 12:22:27 GMT
- Title: Recurrent Deconvolutional Generative Adversarial Networks with
Application to Text Guided Video Generation
- Authors: Hongyuan Yu, Yan Huang, Lihong Pi, Liang Wang
- Abstract summary: We propose a recurrent deconvolutional generative adversarial network (RD-GAN), which includes a 3D convolutional neural network (3D-CNN) as the discriminator.
The proposed model can be jointly trained by pushing the RDN to generate realistic videos so that the 3D-CNN cannot distinguish them from real ones.
- Score: 11.15855312510806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel model for video generation and especially makes
the attempt to deal with the problem of video generation from text
descriptions, i.e., synthesizing realistic videos conditioned on given texts.
Existing video generation methods cannot be easily adapted to handle this task
well, due to the frame discontinuity issue and their text-free generation
schemes. To address these problems, we propose a recurrent deconvolutional
generative adversarial network (RD-GAN), which includes a recurrent
deconvolutional network (RDN) as the generator and a 3D convolutional neural
network (3D-CNN) as the discriminator. The RDN is a deconvolutional version of
conventional recurrent neural network, which can well model the long-range
temporal dependency of generated video frames and make good use of conditional
information. The proposed model can be jointly trained by pushing the RDN to
generate realistic videos so that the 3D-CNN cannot distinguish them from real
ones. We apply the proposed RD-GAN to a series of tasks including conventional
video generation, conditional video generation, video prediction and video
classification, and demonstrate its effectiveness by achieving well
performance.
Related papers
- Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model [55.71885688565501]
We propose a scalable generative video semantic communication framework that extracts and transmits semantic information to achieve high-quality video reconstruction.
Specifically, at the transmitter, description and other condition signals are extracted from the source video, functioning as text and structural semantics, respectively.
At the receiver, the diffusion-based GenAI large models are utilized to fuse the semantics of the multiple modalities for reconstructing the video.
arXiv Detail & Related papers (2025-02-19T15:59:07Z) - World-consistent Video Diffusion with Explicit 3D Modeling [67.39618291644673]
World-consistent Video Diffusion (WVD) is a novel framework that incorporates explicit 3D supervision using XYZ images.
We train a diffusion transformer to learn the joint distribution of RGB and XYZ frames.
WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation.
arXiv Detail & Related papers (2024-12-02T18:58:23Z) - Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation [35.52770785430601]
We propose a novel hybrid video autoencoder, called HVtemporalDM, which can capture intricate dependencies more effectively.
The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video.
Our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details.
arXiv Detail & Related papers (2024-02-21T11:46:16Z) - UniVG: Towards UNIfied-modal Video Generation [27.07637246141562]
We propose a Unified-modal Video Genearation system capable of handling multiple video generation tasks across text and image modalities.
Our method achieves the lowest Fr'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2.
arXiv Detail & Related papers (2024-01-17T09:46:13Z) - Conditional Generative Modeling for Images, 3D Animations, and Video [4.422441608136163]
dissertation attempts to drive innovation in the field of generative modeling for computer vision.
Research focuses on architectures that offer transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation.
arXiv Detail & Related papers (2023-10-19T21:10:39Z) - NeRF-GAN Distillation for Efficient 3D-Aware Generation with
Convolutions [97.27105725738016]
integration of Neural Radiance Fields (NeRFs) and generative models, such as Generative Adversarial Networks (GANs) has transformed 3D-aware generation from single-view images.
We propose a simple and effective method, based on re-using the well-disentangled latent space of a pre-trained NeRF-GAN in a pose-conditioned convolutional network to directly generate 3D-consistent images corresponding to the underlying 3D representations.
arXiv Detail & Related papers (2023-03-22T18:59:48Z) - Generating Videos with Dynamics-aware Implicit Generative Adversarial
Networks [68.93429034530077]
We propose dynamics-aware implicit generative adversarial network (DIGAN) for video generation.
We show that DIGAN can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
arXiv Detail & Related papers (2022-02-21T23:24:01Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.