Fashion Matrix: Editing Photos by Just Talking
- URL: http://arxiv.org/abs/2307.13240v1
- Date: Tue, 25 Jul 2023 04:06:25 GMT
- Title: Fashion Matrix: Editing Photos by Just Talking
- Authors: Zheng Chong, Xujie Zhang, Fuwei Zhao, Zhenyu Xie and Xiaodan Liang
- Abstract summary: We develop a hierarchical AI system called Fashion Matrix dedicated to editing photos by just talking.
Fashion Matrix employs Large Language Models (LLMs) as its foundational support and engages in iterative interactions with users.
Visual Foundation Models are leveraged to generate edited images from text prompts and masks, thereby facilitating the automation of fashion editing processes.
- Score: 66.83502497764698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The utilization of Large Language Models (LLMs) for the construction of AI
systems has garnered significant attention across diverse fields. The extension
of LLMs to the domain of fashion holds substantial commercial potential but
also inherent challenges due to the intricate semantic interactions in
fashion-related generation. To address this issue, we developed a hierarchical
AI system called Fashion Matrix dedicated to editing photos by just talking.
This system facilitates diverse prompt-driven tasks, encompassing garment or
accessory replacement, recoloring, addition, and removal. Specifically, Fashion
Matrix employs LLM as its foundational support and engages in iterative
interactions with users. It employs a range of Semantic Segmentation Models
(e.g., Grounded-SAM, MattingAnything, etc.) to delineate the specific editing
masks based on user instructions. Subsequently, Visual Foundation Models (e.g.,
Stable Diffusion, ControlNet, etc.) are leveraged to generate edited images
from text prompts and masks, thereby facilitating the automation of fashion
editing processes. Experiments demonstrate the outstanding ability of Fashion
Matrix to explores the collaborative potential of functionally diverse
pre-trained models in the domain of fashion editing.
Related papers
- DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing [26.090574235851083]
We introduce a new fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit)
DPDEdit guides the fashion image generation of diffusion models by integrating text prompts, region masks, human pose images, and garment texture images.
To transfer the detail of the given garment texture into the target fashion image, we propose a texture injection and refinement mechanism.
arXiv Detail & Related papers (2024-09-02T09:15:26Z) - AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion [25.61572702219732]
Fashion image editing aims to modify a person's appearance based on a given instruction.
Current methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework.
We propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas.
arXiv Detail & Related papers (2024-08-21T12:04:32Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z) - Guiding Instruction-based Image Editing via Multimodal Large Language
Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation.
We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE)
MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z) - Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing [40.70752781891058]
We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images.
We tackle this problem by proposing a new architecture based on latent diffusion models.
Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
arXiv Detail & Related papers (2023-04-04T18:03:04Z) - FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion [16.583537785874604]
We propose a novel text-conditioned editing model, called FICE, capable of handling a wide variety of diverse text descriptions.
FICE generates highly realistic fashion images and leads to stronger editing performance than existing competing approaches.
arXiv Detail & Related papers (2023-01-05T15:33:23Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - SpaceEdit: Learning a Unified Editing Space for Open-Domain Image
Editing [94.31103255204933]
We propose a unified model for open-domain image editing focusing on color and tone adjustment of open-domain images.
Our model learns a unified editing space that is more semantic, intuitive, and easy to manipulate.
We show that by inverting image pairs into latent codes of the learned editing space, our model can be leveraged for various downstream editing tasks.
arXiv Detail & Related papers (2021-11-30T23:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.