M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing
- URL: http://arxiv.org/abs/2205.11705v1
- Date: Tue, 24 May 2022 01:18:14 GMT
- Title: M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing
- Authors: Zhikang Li, Huiling Zhou, Shuai Bai, Peike Li, Chang Zhou, Hongxia
Yang
- Abstract summary: We adapt style prior knowledge and flexibility of multi-modal control into one unified two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion design.
M6-Fashion utilizes self-correction for the non-autoregressive generation to improve inference speed, enhance holistic consistency, and support various signal controls.
- Score: 51.033376763225675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fashion industry has diverse applications in multi-modal image generation
and editing. It aims to create a desired high-fidelity image with the
multi-modal conditional signal as guidance. Most existing methods learn
different condition guidance controls by introducing extra models or ignoring
the style prior knowledge, which is difficult to handle multiple signal
combinations and faces a low-fidelity problem. In this paper, we adapt both
style prior knowledge and flexibility of multi-modal control into one unified
two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion
design. It decouples style codes in both spatial and semantic dimensions to
guarantee high-fidelity image generation in the first stage. M6-Fashion
utilizes self-correction for the non-autoregressive generation to improve
inference speed, enhance holistic consistency, and support various signal
controls. Extensive experiments on a large-scale clothing dataset M2C-Fashion
demonstrate superior performances on various image generation and editing
tasks. M6-Fashion model serves as a highly potential AI designer for the
fashion industry.
Related papers
- EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts [48.214475133206385]
EMMA is a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA.
By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts.
arXiv Detail & Related papers (2024-06-13T14:26:43Z) - TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing [23.51498634405422]
We present an innovative image editing framework that employs the robust Chain-of-Thought reasoning and localizing capabilities of multimodal large language models.
Our model exhibits an enhanced ability to understand complex prompts and generate corresponding images, while maintaining high fidelity and consistency in images before and after generation.
arXiv Detail & Related papers (2024-05-27T03:50:37Z) - MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation [70.83668869857665]
MMTryon is a multi-modal multi-reference VIrtual Try-ON framework.
It can generate high-quality compositional try-on results by taking a text instruction and multiple garment images as inputs.
arXiv Detail & Related papers (2024-05-01T11:04:22Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds.
MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings.
CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z) - Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing [40.70752781891058]
This paper tackles the task of multimodal-conditioned fashion image editing.
Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures.
arXiv Detail & Related papers (2024-03-21T20:43:10Z) - Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing [40.70752781891058]
We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images.
We tackle this problem by proposing a new architecture based on latent diffusion models.
Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
arXiv Detail & Related papers (2023-04-04T18:03:04Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.