Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation
- URL: http://arxiv.org/abs/2602.18309v1
- Date: Fri, 20 Feb 2026 16:07:31 GMT
- Title: Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation
- Authors: Ziyue Liu, Davide Talon, Federico Girella, Zanxi Ruan, Mattia Mondo, Loris Bazzani, Yiming Wang, Marco Cristani,
- Abstract summary: We present LOcalized Text and Sketch with multi-level guidance (LOTS)<n>LOTS combines global sketch guidance with multiple localized sketch-text pairs.<n>We develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image.
- Score: 14.962452069195544
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.
Related papers
- VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation [73.23035143627598]
Most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing.<n>We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models.<n>Our method generates high-quality sketches that closely follow text-specified orderings while exhibiting rich visual detail.
arXiv Detail & Related papers (2026-02-17T18:55:03Z) - SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing [13.733328072282049]
We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing.<n>To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model.
arXiv Detail & Related papers (2025-12-16T06:50:44Z) - Text to Sketch Generation with Multi-Styles [17.309370958875785]
We propose a training-free framework based on diffusion models that enables explicit style guidance.<n>We incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism.<n>Our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control.
arXiv Detail & Related papers (2025-11-06T07:13:56Z) - Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration [26.920087528015205]
This paper presents a real-time generative drawing system that interprets and integrates both formal intent and contextual intent.<n>The system achieves low-latency, two-stage transformation while supporting multi-user collaboration on shared canvases.
arXiv Detail & Related papers (2025-08-12T01:34:23Z) - LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing [13.90016469666642]
We present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks.<n>LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation.<n>To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image.
arXiv Detail & Related papers (2025-07-30T12:48:29Z) - Recovering Partially Corrupted Objects via Sketch-Guided Bidirectional Feature Interaction [16.03488741913531]
Text-guided diffusion models provide high-level semantic guidance through text prompts.<n>They often lack precise pixel-level spatial control in partially corrupted objects.<n>We propose a sketch-guided bidirectional feature interaction framework built upon a pretrained Stable Diffusion model.
arXiv Detail & Related papers (2025-03-10T08:34:31Z) - LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions.
LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z) - Bridging the Gap: Sketch-Aware Interpolation Network for High-Quality Animation Sketch Inbetweening [58.09847349781176]
We propose a novel deep learning method - Sketch-Aware Interpolation Network (SAIN)
This approach incorporates multi-level guidance that formulates region-level correspondence, stroke-level correspondence and pixel-level dynamics.
A multi-stream U-Transformer is then devised to characterize sketch inbetweening patterns using these multi-level guides through the integration of self / cross-attention mechanisms.
arXiv Detail & Related papers (2023-08-25T09:51:03Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.