Related papers: MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

URL: http://arxiv.org/abs/2510.13702v1
Date: Wed, 15 Oct 2025 16:00:26 GMT
Title: MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
Authors: Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh,
Abstract summary: We introduce a novel task, multi-view customization, which aims to jointly achieve multi-view pose control and customization.<n>We propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity.
Score: 24.513096225720854
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.

Related papers

Unified Personalized Understanding, Generating and Editing [54.5563878110386]
We present textbf OmniPersona, an end-to-end personalization framework for unified LMMs.<n>It integrates personalized understanding, generation, and image editing within a single architecture.<n>Experiments demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks.
arXiv Detail & Related papers (2026-01-11T15:46:34Z)
Towards Generalized Multi-Image Editing for Unified Multimodal Models [56.620038824933566]
Unified Multimodal Models (UMMs) integrate multimodal understanding and generation.<n>UMMs are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images.<n>We propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts.
arXiv Detail & Related papers (2026-01-09T06:42:49Z)
UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation [54.38636515750502]
We propose UniLayDiff: a Unified Diffusion Transformer for content-aware layout generation tasks.<n>We employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints.<n>Experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks.
arXiv Detail & Related papers (2025-12-09T18:38:44Z)
Canvas-to-Image: Compositional Image Generation with Multimodal Controls [51.44122945214702]
We introduce Canvas-to-Image, a unified framework that consolidates heterogeneous controls into a single canvas interface.<n>Our key idea is to encode diverse control signals into a single composite canvas image that the model can interpret for integrated visual-spatial reasoning.
arXiv Detail & Related papers (2025-11-26T18:59:56Z)
ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation [24.487453636504707]
We introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation.<n>We show that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
arXiv Detail & Related papers (2025-10-13T04:21:19Z)
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward [15.094319754425468]
We present UMO, a framework designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability.<n>With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem.<n>We develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts.
arXiv Detail & Related papers (2025-09-08T15:54:55Z)
CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design [69.83433430133302]
CreatiDesign is a systematic solution for automated graphic design covering both model architecture and dataset construction.<n>First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements.<n> Furthermore, to ensure that each condition precisely controls its designated image region, we propose a multimodal attention mask mechanism.
arXiv Detail & Related papers (2025-05-25T12:14:23Z)
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning [40.70596166863986]
Multi-Concept Video Customization (MCVC) remains a significant challenge.<n>We introduce ConceptMaster, a novel framework that effectively addresses the identity decoupling issues.<n>We show that ConceptMaster significantly outperforms previous methods for video customization tasks.
arXiv Detail & Related papers (2025-01-08T18:59:01Z)
MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation [59.00909718832648]
We propose MC$2$, a novel approach for multi-concept customization.<n>By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts.<n>Experiments demonstrate that MC$2$ outperforms training-based methods in terms of prompt-reference alignment.
arXiv Detail & Related papers (2024-04-08T07:59:04Z)
Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP) We develop a generic and personalization generative framework, that can handle a wide range of personalized needs. Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.