Conditional Generative Modeling for Images, 3D Animations, and Video
- URL: http://arxiv.org/abs/2310.13157v1
- Date: Thu, 19 Oct 2023 21:10:39 GMT
- Title: Conditional Generative Modeling for Images, 3D Animations, and Video
- Authors: Vikram Voleti
- Abstract summary: dissertation attempts to drive innovation in the field of generative modeling for computer vision.
Research focuses on architectures that offer transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation.
- Score: 4.422441608136163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This dissertation attempts to drive innovation in the field of generative
modeling for computer vision, by exploring novel formulations of conditional
generative models, and innovative applications in images, 3D animations, and
video. Our research focuses on architectures that offer reversible
transformations of noise and visual data, and the application of
encoder-decoder architectures for generative tasks and 3D content manipulation.
In all instances, we incorporate conditional information to enhance the
synthesis of visual data, improving the efficiency of the generation process as
well as the generated content.
We introduce the use of Neural ODEs to model video dynamics using an
encoder-decoder architecture, demonstrating their ability to predict future
video frames despite being trained solely to reconstruct current frames. Next,
we propose a conditional variant of continuous normalizing flows that enables
higher-resolution image generation based on lower-resolution input, achieving
comparable image quality while reducing parameters and training time. Our next
contribution presents a pipeline that takes human images as input,
automatically aligns a user-specified 3D character with the pose of the human,
and facilitates pose editing based on partial inputs. Next, we derive the
relevant mathematical details for denoising diffusion models that use
non-isotropic Gaussian processes, and show comparable generation quality.
Finally, we devise a novel denoising diffusion framework capable of solving all
three video tasks of prediction, generation, and interpolation. We perform
ablation studies, and show SOTA results on multiple datasets.
Our contributions are published articles at peer-reviewed venues. Overall,
our research aims to make a meaningful contribution to the pursuit of more
efficient and flexible generative models, with the potential to shape the
future of computer vision.
Related papers
- MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.
We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.
In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z) - ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis [63.169364481672915]
We propose textbfViewCrafter, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images.
Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames.
arXiv Detail & Related papers (2024-09-03T16:53:19Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - Video Prediction Models as General Visual Encoders [0.0]
The researchers propose using video prediction models as general visual encoders, leveraging their ability to capture critical spatial and temporal information.
Inspired by human vision studies, the approach aims to develop a latent space representative of motion from images.
Experiments involve adapting pre-trained video generative models, analyzing their latent spaces, and training custom decoders for foreground-background segmentation.
arXiv Detail & Related papers (2024-05-25T23:55:47Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder [73.1010640692609]
We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.
Our model achieves state-of-the-art results and generates more photorealistic images specifically.
arXiv Detail & Related papers (2022-06-01T10:39:12Z) - Insights from Generative Modeling for Neural Video Compression [31.59496634465347]
We present newly proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling.
We propose several architectures that yield state-of-the-art video compression performance on high-resolution video.
We provide further evidence that the generative modeling viewpoint can advance the neural video coding field.
arXiv Detail & Related papers (2021-07-28T02:19:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.