Related papers: LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

URL: http://arxiv.org/abs/2507.22627v1
Date: Wed, 30 Jul 2025 12:48:29 GMT
Title: LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
Authors: Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, Marco Cristani,
Abstract summary: We present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks.<n>LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation.<n>To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image.
Score: 12.33060414705514
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.

Related papers

StyleBlend: Enhancing Style-Specific Content Creation in Text-to-Image Diffusion Models [10.685779311280266]
StyleBlend is a method designed to learn and apply style representations from a limited set of reference images.<n>Our approach decomposes style into two components, composition and texture, each learned through different strategies.
arXiv Detail & Related papers (2025-02-13T08:26:54Z)
ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing [25.610375901522886]
ArtCrafter is a novel framework for text-to-image style transfer.<n>We introduce an attention-based style extraction module.<n>We also present a novel text-image aligning augmentation component.
arXiv Detail & Related papers (2025-01-03T19:17:27Z)
ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model [73.95608242322949]
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. We present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion to address challenges such as misinterpreted styles and inconsistent semantics.
arXiv Detail & Related papers (2024-05-24T07:19:40Z)
StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding [7.291687946822539]
We introduce Single-StyleForge, a novel approach for personalized text-to-image synthesis across diverse artistic styles. We also present Multi-StyleForge, which enhances image quality and text alignment by binding multiple tokens to partial style attributes.
arXiv Detail & Related papers (2024-04-08T07:43:23Z)
HAIFIT: Human-to-AI Fashion Image Translation [6.034505799418777]
We introduce HAIFIT, a novel approach that transforms sketches into high-fidelity, lifelike clothing images. Our method excels in preserving the distinctive style and intricate details essential for fashion design applications.
arXiv Detail & Related papers (2024-03-13T16:06:07Z)
CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing [21.12815542848095]
Personalization techniques for large text-to-image (T2I) models allow users to incorporate new concepts from reference images. Existing methods primarily rely on textual descriptions, leading to limited control over customized images. We identify sketches as an intuitive and versatile representation that can facilitate such control.
arXiv Detail & Related papers (2024-02-27T15:52:59Z)
Style Aligned Image Generation via Shared Attention [61.121465570763085]
We introduce StyleAligned, a technique designed to establish style alignment among a series of generated images. By employing minimal attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. Our method's evaluation across diverse styles and text prompts demonstrates high-quality and fidelity.
arXiv Detail & Related papers (2023-12-04T18:55:35Z)
DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization [66.42741426640633]
DiffStyler is a dual diffusion processing architecture to control the balance between the content and style of diffused results. We propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image.
arXiv Detail & Related papers (2022-11-19T12:30:44Z)
AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation [61.77946020543875]
We propose a framework for translating raw descriptions with complex semantics into semantically corresponding images. Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN. Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training.
arXiv Detail & Related papers (2022-09-07T13:53:54Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
Sketch-to-Art: Synthesizing Stylized Art Images From Sketches [23.75420342238983]
We propose a new approach for synthesizing fully detailed art-stylized images from sketches. Given a sketch, with no semantic tagging, and a reference image of a specific style, the model can synthesize meaningful details with colors and textures.
arXiv Detail & Related papers (2020-02-26T19:02:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.