GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections
- URL: http://arxiv.org/abs/2408.12352v2
- Date: Fri, 23 Aug 2024 05:01:35 GMT
- Title: GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections
- Authors: Shiyue Zhang, Zheng Chong, Xujie Zhang, Hanhui Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang,
- Abstract summary: GarmentAligner is a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections.
To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline.
To exploit component relationships within the garment images, we construct retrieval subsets for each garment.
- Score: 63.82168065819053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.
Related papers
- Unsupervised Part Discovery via Dual Representation Alignment [31.100169532078095]
Object parts serve as crucial intermediate representations in various downstream tasks.
Previous research has established that Vision Transformer can learn instance-level attention without labels.
In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm.
arXiv Detail & Related papers (2024-08-15T12:11:20Z) - Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation [6.479933058008389]
Style-Extracting Diffusion Models generate images with unseen characteristics beneficial for downstream tasks.
In this work, we show the capability of our method on a natural image dataset as a proof-of-concept.
We verify the added value of the generated images by showing improved segmentation results and lower performance variability between patients.
arXiv Detail & Related papers (2024-03-21T14:36:59Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Improving Compositional Text-to-image Generation with Large
Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts.
We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts.
Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z) - Mitigating the Effect of Incidental Correlations on Part-based Learning [50.682498099720114]
Part-based representations could be more interpretable and generalize better with limited data.
We present two innovative regularization methods for part-based representations.
We exhibit state-of-the-art (SoTA) performance on few-shot learning tasks on benchmark datasets.
arXiv Detail & Related papers (2023-09-30T13:44:48Z) - Retrieval-based Spatially Adaptive Normalization for Semantic Image
Synthesis [68.1281982092765]
We propose a novel normalization module, termed as REtrieval-based Spatially AdaptIve normaLization (RESAIL)
RESAIL provides pixel level fine-grained guidance to the normalization architecture.
Experiments on several challenging datasets show that our RESAIL performs favorably against state-of-the-arts in terms of quantitative metrics, visual quality, and subjective evaluation.
arXiv Detail & Related papers (2022-04-06T14:21:39Z) - StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [52.341186561026724]
Lacking compositionality could have severe implications for robustness and fairness.
We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis.
Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
arXiv Detail & Related papers (2022-03-29T17:59:50Z) - Learning Intrinsic Images for Clothing [10.21096394185778]
In this paper, we focus on intrinsic image decomposition for clothing images.
A more interpretable edge-aware metric and an annotation scheme is designed for the testing set.
We show that our proposed model significantly reduce texture-copying artifacts while retaining surprisingly tiny details.
arXiv Detail & Related papers (2021-11-16T14:43:12Z) - Region-adaptive Texture Enhancement for Detailed Person Image Synthesis [86.69934638569815]
RATE-Net is a novel framework for synthesizing person images with sharp texture details.
The proposed framework leverages an additional texture enhancing module to extract appearance information from the source image.
Experiments conducted on DeepFashion benchmark dataset have demonstrated the superiority of our framework compared with existing networks.
arXiv Detail & Related papers (2020-05-26T02:33:21Z) - TailorGAN: Making User-Defined Fashion Designs [28.805686791183618]
We propose a novel self-supervised model to synthesize garment images with disentangled attributes without paired data.
Our method consists of a reconstruction learning step and an adversarial learning step.
Experiments on this dataset and real-world samples demonstrate that our method can synthesize much better results than the state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T16:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.