ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities
- URL: http://arxiv.org/abs/2504.06895v1
- Date: Wed, 09 Apr 2025 13:55:32 GMT
- Title: ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities
- Authors: Dingkun Yan, Xinrui Wang, Yusuke Iwasawa, Yutaka Matsuo, Suguru Saito, Jiaxian Guo,
- Abstract summary: Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry.<n>Most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned.<n>This mismatch in data distribution between training and inference leads to overfitting, resulting in spatial artifacts and significant degradation in overall colorization quality.
- Score: 28.160601838418433
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in spatial artifacts and significant degradation in overall colorization quality, limiting potential applications of current methods for general purposes. To address this limitation, we conduct an in-depth analysis of the \textbf{carrier}, defined as the latent representation facilitating information transfer from reference to sketch. Based on this analysis, we propose a novel workflow that dynamically adapts the carrier to optimize distinct aspects of colorization. Specifically, for spatially misaligned artifacts, we introduce a split cross-attention mechanism with spatial masks, enabling region-specific reference injection within the diffusion process. To mitigate semantic neglect of sketches, we employ dedicated background and style encoders to transfer detailed reference information in the latent feature space, achieving enhanced spatial control and richer detail synthesis. Furthermore, we propose character-mask merging and background bleaching as preprocessing steps to improve foreground-background integration and background generation. Extensive qualitative and quantitative evaluations, including a user study, demonstrate the superior performance of our proposed method compared to existing approaches. An ablation study further validates the efficacy of each proposed component.
Related papers
- Image Referenced Sketch Colorization Based on Animation Creation Workflow [28.281739343084993]
We propose a diffusion-based framework inspired by real-world animation production.<n>Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with masks.<n>This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts.
arXiv Detail & Related papers (2025-02-27T10:04:47Z) - MangaNinja: Line Art Colorization with Precise Reference Following [84.2001766692797]
MangaNinjia specializes in the task of reference-guided line art colorization.<n>We incorporate two thoughtful designs to ensure precise character detail transcription.<n>A patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching.
arXiv Detail & Related papers (2025-01-14T18:59:55Z) - Unsupervised Region-Based Image Editing of Denoising Diffusion Models [50.005612464340246]
We propose a method to identify semantic attributes in the latent space of pre-trained diffusion models without any further training.<n>Our approach facilitates precise semantic discovery and control over local masked areas, eliminating the need for annotations.
arXiv Detail & Related papers (2024-12-17T13:46:12Z) - TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models.
We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization.
Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z) - Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss.
Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z) - Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis [15.76266032768078]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.<n>We first introduce vision guidance as a foundational spatial cue within the perturbed distribution.<n>We propose a universal framework, Layered Rendering Diffusion (LRDiff), which constructs an image-rendering process with multiple layers.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - Grounded Text-to-Image Synthesis with Attention Refocusing [16.9170825951175]
We reveal the potential causes in the diffusion model's cross-attention and self-attention layers.
We propose two novel losses to refocus attention maps according to a given spatial layout during sampling.
We show that our proposed attention refocusing effectively improves the controllability of existing approaches.
arXiv Detail & Related papers (2023-06-08T17:59:59Z) - Cross-domain Compositing with Pretrained Diffusion Models [34.98199766006208]
We employ a localized, iterative refinement scheme which infuses the injected objects with contextual information derived from the background scene.
Our method produces higher quality and realistic results without requiring any annotations or training.
arXiv Detail & Related papers (2023-02-20T18:54:04Z) - BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR [52.78253400327191]
BDA-SketRet is a novel framework performing a bi-level domain adaptation for aligning the spatial and semantic features of the visual data pairs.
Experimental results on the extended Sketchy, TU-Berlin, and QuickDraw exhibit sharp improvements over the literature.
arXiv Detail & Related papers (2022-01-17T18:45:55Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Deep convolutional embedding for digitized painting clustering [14.228308494671703]
We propose a deep convolutional embedding model for digitized painting clustering.
The model is capable of outperforming other state-of-the-art deep clustering approaches to the same problem.
The proposed method can be useful for several art-related tasks, in particular visual link retrieval and historical knowledge discovery in painting datasets.
arXiv Detail & Related papers (2020-03-19T06:49:38Z) - Focus on Semantic Consistency for Cross-domain Crowd Understanding [34.560447389853614]
Some domain adaptation algorithms try to liberate it by training models with synthetic data.
We found that a mass of estimation errors in the background areas impede the performance of the existing methods.
In this paper, we propose a domain adaptation method to eliminate it.
arXiv Detail & Related papers (2020-02-20T08:51:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.