SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches
- URL: http://arxiv.org/abs/2502.07556v1
- Date: Tue, 11 Feb 2025 13:48:11 GMT
- Title: SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches
- Authors: Haichuan Lin, Yilin Ye, Jiazhi Xia, Wei Zeng,
- Abstract summary: SketchFlex is an interactive system designed to improve the flexibility of spatially conditioned image generation.
It infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships.
It refines users' rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions.
- Score: 4.55322003438174
- License:
- Abstract: Text-to-image models can generate visually appealing images from text descriptions. Efforts have been devoted to improving model controls with prompt tuning and spatial conditioning. However, our formative study highlights the challenges for non-expert users in crafting appropriate prompts and specifying fine-grained spatial conditions (e.g., depth or canny references) to generate semantically cohesive images, especially when multiple objects are involved. In response, we introduce SketchFlex, an interactive system designed to improve the flexibility of spatially conditioned image generation using rough region sketches. The system automatically infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships. Additionally, SketchFlex refines users' rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions. Experimental results demonstrate that SketchFlex achieves more cohesive image generations than end-to-end models, meanwhile significantly reducing cognitive load and better matching user intentions compared to region-based generation baseline.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures
in Text-to-Image Generation [18.396131717250793]
We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language.
Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
arXiv Detail & Related papers (2023-03-31T08:06:33Z) - Reference-based Image Composition with Sketch via Structure-aware
Diffusion Model [38.1193912666578]
We introduce a multi-input-conditioned image composition model that incorporates a sketch as a novel modal, alongside a reference image.
Thanks to the edge-level controllability using sketches, our method enables a user to edit or complete an image sub-part.
Our framework fine-tunes a pre-trained diffusion model to complete missing regions using the reference image while maintaining sketch guidance.
arXiv Detail & Related papers (2023-03-31T06:12:58Z) - Adaptively-Realistic Image Generation from Stroke and Sketch with
Diffusion Model [31.652827838300915]
We propose a unified framework supporting a three-dimensional control over the image synthesis from sketches and strokes based on diffusion models.
Our framework achieves state-of-the-art performance while providing flexibility in generating customized images with control over shape, color, and realism.
Our method unleashes applications such as editing on real images, generation with partial sketches and strokes, and multi-domain multi-modal synthesis.
arXiv Detail & Related papers (2022-08-26T13:59:26Z) - Deep Generation of Face Images from Sketches [36.146494762987146]
Deep image-to-image translation techniques allow fast generation of face images from freehand sketches.
Existing solutions tend to overfit to sketches, thus requiring professional sketches or even edge maps as input.
We propose to implicitly model the shape space of plausible face images and synthesize a face image in this space to approximate an input sketch.
Our method essentially uses input sketches as soft constraints and is thus able to produce high-quality face images even from rough and/or incomplete sketches.
arXiv Detail & Related papers (2020-06-01T16:20:23Z) - SketchyCOCO: Image Generation from Freehand Scene Sketches [71.85577739612579]
We introduce the first method for automatic image generation from scene-level freehand sketches.
Key contribution is an attribute vector bridged Geneversarative Adrial Network called EdgeGAN.
We have built a large-scale composite dataset called SketchyCOCO to support and evaluate the solution.
arXiv Detail & Related papers (2020-03-05T14:54:10Z) - Deep Plastic Surgery: Robust and Controllable Image Editing with
Human-Drawn Sketches [133.01690754567252]
Sketch-based image editing aims to synthesize and modify photos based on the structural information provided by the human-drawn sketches.
Deep Plastic Surgery is a novel, robust and controllable image editing framework that allows users to interactively edit images using hand-drawn sketch inputs.
arXiv Detail & Related papers (2020-01-09T08:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.