RegionRoute: Regional Style Transfer with Diffusion Model
- URL: http://arxiv.org/abs/2602.19254v1
- Date: Sun, 22 Feb 2026 16:11:07 GMT
- Title: RegionRoute: Regional Style Transfer with Diffusion Model
- Authors: Bowen Chen, Jake Zuena, Alan C. Bovik, Divya Kothandaraman,
- Abstract summary: We propose an attention-supervised diffusion framework that teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training.<n>A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation.<n> Experiments show that our method achieves mask-free, single-object style transfer at inference.
- Score: 31.189878461660115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.
Related papers
- CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer [85.217605146499]
CoCoDiff is a training-free and low-cost style transfer framework for computer vision.<n>It exploits pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization.<n>CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
arXiv Detail & Related papers (2026-02-16T04:52:29Z) - Style Composition within Distinct LoRA modules for Traditional Art [21.954368353156546]
We propose a zero-shot diffusion pipeline that naturally blends multiple styles.<n>We leverage the fact that lower-noise latents carry stronger stylistic information.<n>We incorporate depth-map conditioning via ControlNet into the diffusion framework.
arXiv Detail & Related papers (2025-07-16T07:36:07Z) - Unsupervised Region-Based Image Editing of Denoising Diffusion Models [50.005612464340246]
We propose a method to identify semantic attributes in the latent space of pre-trained diffusion models without any further training.<n>Our approach facilitates precise semantic discovery and control over local masked areas, eliminating the need for annotations.
arXiv Detail & Related papers (2024-12-17T13:46:12Z) - Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis [18.755311950243737]
The latent space of Diffusion Models (DMs) is not as well understood as that of Generative Adversarial Networks (GANs)
Recent research has focused on unsupervised semantic discovery in the latent space of DMs.
We introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs.
arXiv Detail & Related papers (2024-08-29T18:21:50Z) - LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model [19.37714374680383]
LocalStyleFool is an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos.
We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey.
arXiv Detail & Related papers (2024-03-18T10:53:00Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion Models [69.33072075580483]
This paper introduces LIME for localized image editing in diffusion models.<n>LIME does not require user-specified regions of interest (RoI) or additional text input, but rather employs features from pre-trained methods and a straightforward clustering method to obtain precise editing mask.<n>We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - SARA: Controllable Makeup Transfer with Spatial Alignment and Region-Adaptive Normalization [67.90315365909244]
We propose a novel Spatial Alignment and Region-Adaptive normalization method (SARA) in this paper.
Our method generates detailed makeup transfer results that can handle large spatial misalignments and achieve part-specific and shade-controllable makeup transfer.
Experimental results show that our SARA method outperforms existing methods and achieves state-of-the-art performance on two public datasets.
arXiv Detail & Related papers (2023-11-28T14:46:51Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - MODIFY: Model-driven Face Stylization without Style Images [77.24793103549158]
Existing face stylization methods always acquire the presence of the target (style) domain during the translation process.
We propose a new method called MODel-drIven Face stYlization (MODIFY), which relies on the generative model to bypass the dependence of the target images.
Experimental results on several different datasets validate the effectiveness of MODIFY for unsupervised face stylization.
arXiv Detail & Related papers (2023-03-17T08:35:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.