JOG3R: Towards 3D-Consistent Video Generators
- URL: http://arxiv.org/abs/2501.01409v2
- Date: Wed, 26 Mar 2025 20:53:45 GMT
- Title: JOG3R: Towards 3D-Consistent Video Generators
- Authors: Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, Duygu Ceylan,
- Abstract summary: We investigate whether video generators similarly exhibit 3D-awareness.<n>Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator can support camera pose estimation.<n>We propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
- Score: 12.967198315639106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
Related papers
- VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step [13.168559963356952]
VideoScene aims to distill the video diffusion model to generate 3D scenes in one step.
VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models.
arXiv Detail & Related papers (2025-04-02T17:59:21Z) - LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors [107.83398512719981]
Single-image 3D reconstruction remains a fundamental challenge in computer vision.<n>Recent advances in Latent Video Diffusion Models offer promising 3D priors learned from large-scale video data.<n>We propose LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency.
arXiv Detail & Related papers (2024-12-12T18:58:42Z) - AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers [66.29824750770389]
We analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation.<n>We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture.
arXiv Detail & Related papers (2024-11-27T18:49:13Z) - Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.
Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z) - SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views [36.02533658048349]
We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for sparse-view images.
SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views.
It requires only about 20 seconds to produce a textured mesh and camera poses for the input views.
arXiv Detail & Related papers (2024-08-19T17:53:10Z) - Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion [3.545941891218148]
We investigate whether it is necessary to explicitly enforce multiview consistency over time, as current approaches do, or if it is sufficient for a model to generate 3D representations of each timestep independently.
We propose a model, Vid3D, that leverages 2D video diffusion to generate 3D videos by first generating a 2D "seed" of the video's temporal dynamics and then independently generating a 3D representation for each timestep in the seed video.
arXiv Detail & Related papers (2024-06-17T04:09:04Z) - V3D: Video Diffusion Models are Effective 3D Generators [19.33837029942662]
We introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation.
Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image.
Our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views.
arXiv Detail & Related papers (2024-03-11T14:03:36Z) - IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality
3D Generation [96.32684334038278]
In this paper, we explore the design space of text-to-3D models.
We significantly improve multi-view generation by considering video instead of image generators.
Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x.
arXiv Detail & Related papers (2024-02-13T18:59:51Z) - Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input [23.505846631252993]
In this work, we focus on the robustness of 2D-to-3D pose lifters.
We observe the poor generalization of state-of-the-art 3D pose lifters in the presence of corruption.
We introduce Temporal Additive Gaussian Noise (TAGN) as a simple yet effective 2D input pose data augmentation.
To incorporate the confidence scores output by the 2D pose detectors, we design a confidence-aware convolution (CA-Conv) block.
arXiv Detail & Related papers (2023-12-11T19:13:38Z) - Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular
Videos in the Wild [10.849750765175754]
POTR-3D is a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE.
It robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs.
arXiv Detail & Related papers (2023-09-15T06:17:22Z) - PV3D: A 3D Generative Model for Portrait Video Generation [94.96025739097922]
We propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos.
PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing.
arXiv Detail & Related papers (2022-12-13T05:42:44Z) - 3D-Aware Video Generation [149.5230191060692]
We explore 4D generative adversarial networks (GANs) that learn generation of 3D-aware videos.
By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos.
arXiv Detail & Related papers (2022-06-29T17:56:03Z) - Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z) - Unsupervised object-centric video generation and decomposition in 3D [36.08064849807464]
We propose to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background.
Our model is trained from monocular videos without any supervision, yet learns to generate coherent 3D scenes containing several moving objects.
arXiv Detail & Related papers (2020-07-07T18:01:29Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.