ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models
- URL: http://arxiv.org/abs/2512.14099v1
- Date: Tue, 16 Dec 2025 05:15:07 GMT
- Title: ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models
- Authors: Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li,
- Abstract summary: We introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation.<n>By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints.<n>Our approach ranks first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS.
- Score: 70.28556518166037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.
Related papers
- FROMAT: Multiview Material Appearance Transfer via Few-Shot Self-Attention Adaptation [49.74776147964999]
We present a lightweight adaptation technique for appearance transfer in multiview diffusion models.<n>Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output.
arXiv Detail & Related papers (2025-12-10T13:06:40Z) - Seg-VAR: Image Segmentation with Visual Autoregressive Modeling [60.79579744943664]
We propose a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem.<n>This is achieved by replacing the discriminative learning with the latent learning process.<n>Our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens, and (3) a decoder reconstructing masks from these latents.
arXiv Detail & Related papers (2025-11-16T13:36:19Z) - MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention [83.56588173102594]
We introduce a solution called mesh attention to enable training at 1024x1024 resolution.<n>This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency.<n>Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT.
arXiv Detail & Related papers (2025-03-11T17:50:59Z) - CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation [59.257513664564996]
We introduce a novel method for generating 360deg panoramas from text prompts or images.<n>We employ multi-view diffusion models to jointly synthesize the six faces of a cubemap.<n>Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set.
arXiv Detail & Related papers (2025-01-28T18:59:49Z) - MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View [0.0]
This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model.
Our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.
arXiv Detail & Related papers (2024-05-06T22:55:53Z) - Envision3D: One Image to 3D with Anchor Views Interpolation [18.31796952040799]
We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image.
It is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.
arXiv Detail & Related papers (2024-03-13T18:46:33Z) - EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion [60.30030562932703]
EpiDiff is a localized interactive multiview diffusion model.
It generates 16 multiview images in just 12 seconds.
It surpasses previous methods in quality evaluation metrics.
arXiv Detail & Related papers (2023-12-11T05:20:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.