Related papers: AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

URL: http://arxiv.org/abs/2506.23150v1
Date: Sun, 29 Jun 2025 09:01:28 GMT
Title: AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation
Authors: Xinyue Liang, Zhiyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang,
Abstract summary: Intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC)<n>We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment.
Score: 13.131418906572163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.

Related papers

Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction [28.19356197940266]
Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds.<n>Our method can enhance the robustness of reconstruction by leveraging generative priors.
arXiv Detail & Related papers (2026-01-07T16:57:30Z)
Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image [68.55613894952177]
We introduce textbfWonder3D++, a novel method for efficiently generating high-fidelity textured meshes from single-view images.<n>We propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images.<n> Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner.
arXiv Detail & Related papers (2025-11-03T17:24:18Z)
UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction [73.29048162438797]
We introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model.<n>Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images.<n>Experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction.
arXiv Detail & Related papers (2025-10-02T04:50:18Z)
RobustGS: Unified Boosting of Feedforward 3D Gaussian Splatting under Low-Quality Conditions [67.48495052903534]
We propose a general and efficient multi-view feature enhancement module, RobustGS.<n>It substantially improves the robustness of feedforward 3DGS methods under various adverse imaging conditions.<n>The RobustGS module can be seamlessly integrated into existing pretrained pipelines in a plug-and-play manner.
arXiv Detail & Related papers (2025-08-05T04:50:29Z)
DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion [50.90541069907167]
We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation.<n>Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency.
arXiv Detail & Related papers (2025-06-26T17:58:26Z)
GenFusion: Closing the Loop between Reconstruction and Generation via Videos [24.195304481751602]
We propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings.<n>We also propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set.
arXiv Detail & Related papers (2025-03-27T07:16:24Z)
VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling [20.329392012132885]
We propose VideoRFSplat, a text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes.<n>VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling.
arXiv Detail & Related papers (2025-03-20T05:26:09Z)
CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction [25.468907201804093]
Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content.<n>However, 2D diffusion models often struggle to produce dense images with strong multi-view consistency.<n>We present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view.
arXiv Detail & Related papers (2025-03-11T03:08:43Z)
FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads [54.24070918942727]
We present FaceLift, a novel feed-forward approach for high-quality 360-degree 3D head reconstruction from a single image.<n>Our pipeline first employs a multi-view latent diffusion model to generate consistent side and back views from a single input.<n>We show that FaceLift outperforms state-of-the-art 3D face reconstruction methods on identity preservation, detail recovery, and rendering quality.
arXiv Detail & Related papers (2024-12-23T18:59:49Z)
Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation [23.87609214530216]
Hunyuan3D 1.0 achieves an impressive balance between speed and quality.<n>Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation.
arXiv Detail & Related papers (2024-11-04T17:21:42Z)
Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.<n>We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.<n>In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z)
Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention [54.66152436050373]
We propose a Multi-view Large Reconstruction Model (M-LRM) to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner.<n>Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images.<n>Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity.
arXiv Detail & Related papers (2024-06-11T18:29:13Z)
MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View [0.0]
This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. Our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.
arXiv Detail & Related papers (2024-05-06T22:55:53Z)
Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion [101.15628083270224]
We propose a novel multi-view conditioned diffusion model to synthesize high-fidelity novel view images.<n>We then introduce a novel iterative-update strategy to adopt it to provide precise guidance to refine the coarse generated results.<n>Experiments show Magic-Boost greatly enhances the coarse generated inputs, generates high-quality 3D assets with rich geometric and textural details.
arXiv Detail & Related papers (2024-04-09T16:20:03Z)
DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model [86.37536249046943]
textbfDMV3D is a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering.
arXiv Detail & Related papers (2023-11-15T18:58:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.