L-CAD: Language-based Colorization with Any-level Descriptions using
Diffusion Priors
- URL: http://arxiv.org/abs/2305.15217v3
- Date: Mon, 23 Oct 2023 04:56:09 GMT
- Title: L-CAD: Language-based Colorization with Any-level Descriptions using
Diffusion Priors
- Authors: Zheng Chang, Shuchen Weng, Peixuan Zhang, Yu Li, Si Li, Boxin Shi
- Abstract summary: We propose a unified model to perform language-based colorization with any-level descriptions.
We leverage the pretrained cross-modality generative model for its robust language understanding and rich color priors.
With the proposed novel sampling strategy, our model achieves instance-aware colorization in diverse and complex scenarios.
- Score: 62.80068955192816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-based colorization produces plausible and visually pleasing colors
under the guidance of user-friendly natural language descriptions. Previous
methods implicitly assume that users provide comprehensive color descriptions
for most of the objects in the image, which leads to suboptimal performance. In
this paper, we propose a unified model to perform language-based colorization
with any-level descriptions. We leverage the pretrained cross-modality
generative model for its robust language understanding and rich color priors to
handle the inherent ambiguity of any-level descriptions. We further design
modules to align with input conditions to preserve local spatial structures and
prevent the ghosting effect. With the proposed novel sampling strategy, our
model achieves instance-aware colorization in diverse and complex scenarios.
Extensive experimental results demonstrate our advantages of effectively
handling any-level descriptions and outperforming both language-based and
automatic colorization methods. The code and pretrained models are available
at: https://github.com/changzheng123/L-CAD.
Related papers
- L-C4: Language-Based Video Colorization for Creative and Consistent Color [59.069498113050436]
We present Language-based video colorization for Creative and Consistent Colors (L-C4)
Our model is built upon a pre-trained cross-modality generative model.
We propose temporally deformable attention to prevent flickering or color shifts, and cross-clip fusion to maintain long-term color consistency.
arXiv Detail & Related papers (2024-10-07T12:16:21Z) - Control Color: Multimodal Diffusion-based Interactive Image Colorization [81.68817300796644]
Control Color (Ctrl Color) is a multi-modal colorization method that leverages the pre-trained Stable Diffusion (SD) model.
We present an effective way to encode user strokes to enable precise local color manipulation.
We also introduce a novel module based on self-attention and a content-guided deformable autoencoder to address the long-standing issues of color overflow and inaccurate coloring.
arXiv Detail & Related papers (2024-02-16T17:51:13Z) - Language-based Photo Color Adjustment for Graphic Designs [38.43984897069872]
We introduce an interactive language-based approach for photo recoloring.
Our model can predict the source colors and the target regions, and then recolor the target regions with the source colors based on the given language-based instruction.
arXiv Detail & Related papers (2023-08-06T08:53:49Z) - DiffColor: Toward High Fidelity Text-Guided Image Colorization with
Diffusion Models [12.897939032560537]
We propose a new method called DiffColor to recover vivid colors conditioned on a prompt text.
We first fine-tune a pre-trained text-to-image model to generate colorized images using a CLIP-based contrastive loss.
Then we try to obtain an optimized text embedding aligning the colorized image and the text prompt, and a fine-tuned diffusion model enabling high-quality image reconstruction.
Our method can produce vivid and diverse colors with a few iterations, and keep the structure and background intact while having colors well-aligned with the target language guidance.
arXiv Detail & Related papers (2023-08-03T09:38:35Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized
Tokenizer of a Large-Scale Generative Model [64.26721402514957]
We propose StylerDALLE, a style transfer method that uses natural language to describe abstract art styles.
Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation.
To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision.
arXiv Detail & Related papers (2023-03-16T12:44:44Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.