HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
- URL: http://arxiv.org/abs/2505.23186v1
- Date: Thu, 29 May 2025 07:23:40 GMT
- Title: HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
- Authors: Junyi Guo, Jingxuan Zhang, Fangyu Wu, Huanda Lu, Qiufeng Wang, Wenmian Yang, Eng Gee Lim, Dongming Lu,
- Abstract summary: HiGarment is a novel framework that enhances fabric representation across textual and visual modalities.<n>We propose Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance.<n>We collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation.
- Score: 20.177936034245572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
Related papers
- LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing [12.33060414705514]
We present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks.<n>LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation.<n>To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image.
arXiv Detail & Related papers (2025-07-30T12:48:29Z) - IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design [44.46962562795136]
IMAGGarment-1 is a fine-grained garment generation framework.<n>It enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement.
arXiv Detail & Related papers (2025-04-17T17:59:47Z) - Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting [39.50293003775675]
We propose a novel garment-centric outpainting (GCO) framework based on the latent diffusion model (LDM)<n>The proposed framework aims at customizing a fashion model wearing a given garment via text prompts and facial images.
arXiv Detail & Related papers (2025-03-03T08:30:37Z) - AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models [7.534556848810697]
We propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and personalized text prompts.<n>AnyDressing comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features.<n>We introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments.
arXiv Detail & Related papers (2024-12-05T13:16:47Z) - Improving Virtual Try-On with Garment-focused Diffusion Models [91.95830983115474]
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks.
We shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process.
Experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches.
arXiv Detail & Related papers (2024-09-12T17:55:11Z) - Multi-Garment Customized Model Generation [3.1679243514285194]
Multi-Garment Customized Model Generation is a unified framework based on Latent Diffusion Models (LDMs)
Our framework supports the conditional generation of multiple garments through decoupled multi-garment feature fusion.
The proposed garment encoder is a plug-and-play module that can be combined with other extension modules.
arXiv Detail & Related papers (2024-08-09T17:57:33Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal
Fashion Design [66.68194916359309]
Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain.
MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information.
ArmANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image.
arXiv Detail & Related papers (2022-08-11T03:44:02Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z) - Region-adaptive Texture Enhancement for Detailed Person Image Synthesis [86.69934638569815]
RATE-Net is a novel framework for synthesizing person images with sharp texture details.
The proposed framework leverages an additional texture enhancing module to extract appearance information from the source image.
Experiments conducted on DeepFashion benchmark dataset have demonstrated the superiority of our framework compared with existing networks.
arXiv Detail & Related papers (2020-05-26T02:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.