ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal
Fashion Design
- URL: http://arxiv.org/abs/2208.05621v1
- Date: Thu, 11 Aug 2022 03:44:02 GMT
- Title: ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal
Fashion Design
- Authors: Xujie Zhang, Yu Sha, Michael C. Kampffmeyer, Zhenyu Xie, Zequn Jie,
Chengwen Huang, Jianqing Peng, Xiaodan Liang
- Abstract summary: Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain.
MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information.
ArmANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image.
- Score: 66.68194916359309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal fashion image synthesis has emerged as one of the most promising
directions in the generation domain due to the vast untapped potential of
incorporating multiple modalities and the wide range of fashion image
applications. To facilitate accurate generation, cross-modal synthesis methods
typically rely on Contrastive Language-Image Pre-training (CLIP) to align
textual and garment information. In this work, we argue that simply aligning
texture and garment information is not sufficient to capture the semantics of
the visual information and therefore propose MaskCLIP. MaskCLIP decomposes the
garments into semantic parts, ensuring fine-grained and semantically accurate
alignment between the visual and text information. Building on MaskCLIP, we
propose ARMANI, a unified cross-modal fashion designer with part-level
garment-text alignment. ARMANI discretizes an image into uniform tokens based
on a learned cross-modal codebook in its first stage and uses a Transformer to
model the distribution of image tokens for a real image given the tokens of the
control signals in its second stage. Contrary to prior approaches that also
rely on two-stage paradigms, ARMANI introduces textual tokens into the
codebook, making it possible for the model to utilize fine-grain semantic
information to generate more realistic images. Further, by introducing a
cross-modal Transformer, ARMANI is versatile and can accomplish image synthesis
from various control signals, such as pure text, sketch images, and partial
images. Extensive experiments conducted on our newly collected cross-modal
fashion dataset demonstrate that ARMANI generates photo-realistic images in
diverse synthesis tasks and outperforms existing state-of-the-art cross-modal
image synthesis approaches.Our code is available at
https://github.com/Harvey594/ARMANI.
Related papers
- IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis [8.080248399002663]
In this paper, semantic image synthesis is treated as an image denoising task.
The style reference is first contaminated with random noise and then progressively denoised by IIDM.
Three techniques, refinement, color-transfer and model ensembles are proposed to further boost the generation quality.
arXiv Detail & Related papers (2024-03-20T08:21:00Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - AI Illustrator: Translating Raw Descriptions into Images by Prompt-based
Cross-Modal Generation [61.77946020543875]
We propose a framework for translating raw descriptions with complex semantics into semantically corresponding images.
Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN.
Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training.
arXiv Detail & Related papers (2022-09-07T13:53:54Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.