FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion
- URL: http://arxiv.org/abs/2301.02110v1
- Date: Thu, 5 Jan 2023 15:33:23 GMT
- Title: FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion
- Authors: Martin Pernu\v{s}, Clinton Fookes, Vitomir \v{S}truc, Simon
Dobri\v{s}ek
- Abstract summary: We propose a novel text-conditioned editing model, called FICE, capable of handling a wide variety of diverse text descriptions.
FICE generates highly realistic fashion images and leads to stronger editing performance than existing competing approaches.
- Score: 16.583537785874604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fashion-image editing represents a challenging computer vision task, where
the goal is to incorporate selected apparel into a given input image. Most
existing techniques, known as Virtual Try-On methods, deal with this task by
first selecting an example image of the desired apparel and then transferring
the clothing onto the target person. Conversely, in this paper, we consider
editing fashion images with text descriptions. Such an approach has several
advantages over example-based virtual try-on techniques, e.g.: (i) it does not
require an image of the target fashion item, and (ii) it allows the expression
of a wide variety of visual concepts through the use of natural language.
Existing image-editing methods that work with language inputs are heavily
constrained by their requirement for training sets with rich attribute
annotations or they are only able to handle simple text descriptions. We
address these constraints by proposing a novel text-conditioned editing model,
called FICE (Fashion Image CLIP Editing), capable of handling a wide variety of
diverse text descriptions to guide the editing procedure. Specifically with
FICE, we augment the common GAN inversion process by including semantic,
pose-related, and image-level constraints when generating images. We leverage
the capabilities of the CLIP model to enforce the semantics, due to its
impressive image-text association capabilities. We furthermore propose a
latent-code regularization technique that provides the means to better control
the fidelity of the synthesized images. We validate FICE through rigorous
experiments on a combination of VITON images and Fashion-Gen text descriptions
and in comparison with several state-of-the-art text-conditioned image editing
approaches. Experimental results demonstrate FICE generates highly realistic
fashion images and leads to stronger editing performance than existing
competing approaches.
Related papers
- A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - Visual Instruction Inversion: Image Editing via Visual Prompting [34.96778567507126]
We present a method for image editing via visual prompting.
We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions.
arXiv Detail & Related papers (2023-07-26T17:50:10Z) - CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes.
Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z) - TD-GEM: Text-Driven Garment Editing Mapper [15.121103742607383]
We propose a Text-Driven Garment Editing Mapper (TD-GEM) to edit fashion items in a disentangled way.
An optimization-based Contrastive Language-Image Pre-training is then utilized to guide the latent representation of a fashion image.
Our TD-GEM manipulates the image accurately according to the target attribute expressed in terms of a text prompt.
arXiv Detail & Related papers (2023-05-29T14:31:54Z) - Direct Inversion: Optimization-Free Text-Driven Real Image Editing with
Diffusion Models [0.0]
We propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt.
We prove our method's efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits.
arXiv Detail & Related papers (2022-11-15T01:07:38Z) - AI Illustrator: Translating Raw Descriptions into Images by Prompt-based
Cross-Modal Generation [61.77946020543875]
We propose a framework for translating raw descriptions with complex semantics into semantically corresponding images.
Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN.
Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training.
arXiv Detail & Related papers (2022-09-07T13:53:54Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - HairCLIP: Design Your Hair by Text and Reference Image [100.85116679883724]
This paper proposes a new hair editing interaction mode, which enables manipulating hair attributes individually or jointly.
We encode the image and text conditions in a shared embedding space and propose a unified hair editing framework.
With the carefully designed network structures and loss functions, our framework can perform high-quality hair editing.
arXiv Detail & Related papers (2021-12-09T18:59:58Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.