MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
- URL: http://arxiv.org/abs/2510.13702v1
- Date: Wed, 15 Oct 2025 16:00:26 GMT
- Title: MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
- Authors: Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh,
- Abstract summary: We introduce a novel task, multi-view customization, which aims to jointly achieve multi-view pose control and customization.<n>We propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity.
- Score: 24.513096225720854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.
Related papers
- Unified Personalized Understanding, Generating and Editing [54.5563878110386]
We present textbf OmniPersona, an end-to-end personalization framework for unified LMMs.<n>It integrates personalized understanding, generation, and image editing within a single architecture.<n>Experiments demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks.
arXiv Detail & Related papers (2026-01-11T15:46:34Z) - Towards Generalized Multi-Image Editing for Unified Multimodal Models [56.620038824933566]
Unified Multimodal Models (UMMs) integrate multimodal understanding and generation.<n>UMMs are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images.<n>We propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts.
arXiv Detail & Related papers (2026-01-09T06:42:49Z) - UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation [54.38636515750502]
We propose UniLayDiff: a Unified Diffusion Transformer for content-aware layout generation tasks.<n>We employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints.<n>Experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks.
arXiv Detail & Related papers (2025-12-09T18:38:44Z) - Canvas-to-Image: Compositional Image Generation with Multimodal Controls [51.44122945214702]
We introduce Canvas-to-Image, a unified framework that consolidates heterogeneous controls into a single canvas interface.<n>Our key idea is to encode diverse control signals into a single composite canvas image that the model can interpret for integrated visual-spatial reasoning.
arXiv Detail & Related papers (2025-11-26T18:59:56Z) - ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation [24.487453636504707]
We introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation.<n>We show that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
arXiv Detail & Related papers (2025-10-13T04:21:19Z) - UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward [15.094319754425468]
We present UMO, a framework designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability.<n>With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem.<n>We develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts.
arXiv Detail & Related papers (2025-09-08T15:54:55Z) - CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design [69.83433430133302]
CreatiDesign is a systematic solution for automated graphic design covering both model architecture and dataset construction.<n>First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements.<n> Furthermore, to ensure that each condition precisely controls its designated image region, we propose a multimodal attention mask mechanism.
arXiv Detail & Related papers (2025-05-25T12:14:23Z) - ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning [40.70596166863986]
Multi-Concept Video Customization (MCVC) remains a significant challenge.<n>We introduce ConceptMaster, a novel framework that effectively addresses the identity decoupling issues.<n>We show that ConceptMaster significantly outperforms previous methods for video customization tasks.
arXiv Detail & Related papers (2025-01-08T18:59:01Z) - MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation [59.00909718832648]
We propose MC$2$, a novel approach for multi-concept customization.<n>By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts.<n>Experiments demonstrate that MC$2$ outperforms training-based methods in terms of prompt-reference alignment.
arXiv Detail & Related papers (2024-04-08T07:59:04Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.