Related papers: RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

URL: http://arxiv.org/abs/2403.00483v1
Date: Fri, 1 Mar 2024 12:12:09 GMT
Title: RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
Authors: Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, Yongdong Zhang
Abstract summary: Text-to-image customization aims to synthesize text-driven images for the given subjects. Existing works follow the pseudo-word paradigm, i.e., represent the given subjects as pseudo-words and then compose them with the given text. We present RealCustom that, for the first time, disentangles similarity from controllability by precisely limiting subject influence to relevant parts only.
Score: 57.86083349873154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image customization, which aims to synthesize text-driven images for the given subjects, has recently revolutionized content creation. Existing works follow the pseudo-word paradigm, i.e., represent the given subjects as pseudo-words and then compose them with the given text. However, the inherent entangled influence scope of pseudo-words with the given text results in a dual-optimum paradox, i.e., the similarity of the given subjects and the controllability of the given text could not be optimal simultaneously. We present RealCustom that, for the first time, disentangles similarity from controllability by precisely limiting subject influence to relevant parts only, achieved by gradually narrowing real text word from its general connotation to the specific subject and using its cross-attention to distinguish relevance. Specifically, RealCustom introduces a novel "train-inference" decoupled framework: (1) during training, RealCustom learns general alignment between visual conditions to original textual conditions by a novel adaptive scoring module to adaptively modulate influence quantity; (2) during inference, a novel adaptive mask guidance strategy is proposed to iteratively update the influence scope and influence quantity of the given subjects to gradually narrow the generation of the real text word. Comprehensive experiments demonstrate the superior real-time customization ability of RealCustom in the open domain, achieving both unprecedented similarity of the given subjects and controllability of the given text for the first time. The project page is https://corleone-huang.github.io/realcustom/.

Related papers

RealCustom++: Representing Images as Real-Word for Real-Time Customization [80.04828124070418]
Text-to-image customization aims to synthesize new images that align with both text semantics and subject appearance. Existing works follow the pseudo-word paradigm, which involves representing given subjects as pseudo-words. We propose a novel real-words paradigm termed RealCustom++ that instead represents subjects as non-conflict real words.
arXiv Detail & Related papers (2024-08-19T07:15:44Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens. We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z)
FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework [19.564048493848272]
Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image. Existing style-transfer-based approaches have shown sub-par editing performance due to complex image backgrounds, diverse font attributes, and varying word lengths within the text. We propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations.
arXiv Detail & Related papers (2023-08-05T15:54:06Z)
Prompt-to-Prompt Image Editing with Cross Attention Control [41.26939787978142]
We present an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. We show our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
arXiv Detail & Related papers (2022-08-02T17:55:41Z)
Text Revision by On-the-Fly Representation Optimization [76.11035270753757]
Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems. We present an iterative in-place editing approach for text revision, which requires no parallel data. It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification.
arXiv Detail & Related papers (2022-04-15T07:38:08Z)
Improving Disentangled Text Representation Learning with Information-Theoretic Guidance [99.68851329919858]
discrete nature of natural language makes disentangling of textual representations more challenging. Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation.
arXiv Detail & Related papers (2020-06-01T03:36:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.