Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers
- URL: http://arxiv.org/abs/2503.13588v1
- Date: Mon, 17 Mar 2025 17:59:59 GMT
- Title: Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers
- Authors: Shiran Yuan, Hao Zhao,
- Abstract summary: ArchonView is a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining.<n>Our model also exhibits robust performance even for difficult camera poses where previous methods fail, and is several times faster in inference speed compared to diffusion.
- Score: 4.015569252776372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Methods based on diffusion backbones have recently revolutionized novel view synthesis (NVS). However, those models require pretrained 2D diffusion checkpoints (e.g., Stable Diffusion) as the basis for geometrical priors. Since such checkpoints require exorbitant amounts of data and compute to train, this greatly limits the scalability of diffusion-based NVS models. We present Next-Scale Autoregression Conditioned by View (ArchonView), a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining. We achieve this by incorporating both global (pose-augmented semantics) and local (multi-scale hierarchical encodings) conditioning into a backbone based on the next-scale autoregression paradigm. Our model also exhibits robust performance even for difficult camera poses where previous methods fail, and is several times faster in inference speed compared to diffusion. We experimentally verify that performance scales with model and dataset size, and conduct extensive demonstration of our method's synthesis quality across several tasks. Our code is open-sourced at https://github.com/Shiran-Yuan/ArchonView.
Related papers
- Stable Virtual Camera: Generative View Synthesis with Diffusion Models [51.71244310522393]
We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene.
Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy.
Our method can generate high-quality videos lasting up to half a minute with seamless loop closure.
arXiv Detail & Related papers (2025-03-18T17:57:22Z) - SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images [49.7344030427291]
We study the problem of single-image 3D object reconstruction.<n>Recent works have diverged into two directions: regression-based modeling and generative modeling.<n>We present SPAR3D, a novel two-stage approach aiming to take the best of both directions.
arXiv Detail & Related papers (2025-01-08T18:52:03Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.<n>We empirically find that this training paradigm limits the one-step generation performance of consistency models.<n>We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior [53.52396082006044]
Current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints.
This issue stems from the sparse training views captured by a fixed camera on a moving vehicle.
We propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model.
arXiv Detail & Related papers (2024-03-29T09:20:29Z) - DiffComplete: Diffusion-based Generative 3D Shape Completion [114.43353365917015]
We introduce a new diffusion-based approach for shape completion on 3D range scans.
We strike a balance between realism, multi-modality, and high fidelity.
DiffComplete sets a new SOTA performance on two large-scale 3D shape completion benchmarks.
arXiv Detail & Related papers (2023-06-28T16:07:36Z) - HoloDiffusion: Training a 3D Diffusion Model using 2D Images [71.1144397510333]
We introduce a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision.
We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.
arXiv Detail & Related papers (2023-03-29T07:35:56Z) - Designing BERT for Convolutional Networks: Sparse and Hierarchical
Masked Modeling [23.164631160130092]
We extend the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets)
We treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode.
This is the first use of sparse convolution for 2D masked modeling.
arXiv Detail & Related papers (2023-01-09T18:59:50Z) - Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D
Generation [28.25023686484727]
A diffusion model learns to predict a vector field of gradients.
We propose a chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable field.
We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset.
arXiv Detail & Related papers (2022-12-01T18:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.