M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing
- URL: http://arxiv.org/abs/2205.11705v1
- Date: Tue, 24 May 2022 01:18:14 GMT
- Title: M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing
- Authors: Zhikang Li, Huiling Zhou, Shuai Bai, Peike Li, Chang Zhou, Hongxia
Yang
- Abstract summary: We adapt style prior knowledge and flexibility of multi-modal control into one unified two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion design.
M6-Fashion utilizes self-correction for the non-autoregressive generation to improve inference speed, enhance holistic consistency, and support various signal controls.
- Score: 51.033376763225675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fashion industry has diverse applications in multi-modal image generation
and editing. It aims to create a desired high-fidelity image with the
multi-modal conditional signal as guidance. Most existing methods learn
different condition guidance controls by introducing extra models or ignoring
the style prior knowledge, which is difficult to handle multiple signal
combinations and faces a low-fidelity problem. In this paper, we adapt both
style prior knowledge and flexibility of multi-modal control into one unified
two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion
design. It decouples style codes in both spatial and semantic dimensions to
guarantee high-fidelity image generation in the first stage. M6-Fashion
utilizes self-correction for the non-autoregressive generation to improve
inference speed, enhance holistic consistency, and support various signal
controls. Extensive experiments on a large-scale clothing dataset M2C-Fashion
demonstrate superior performances on various image generation and editing
tasks. M6-Fashion model serves as a highly potential AI designer for the
fashion industry.
Related papers
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation [29.489516715874306]
We present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain.
Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks.
arXiv Detail & Related papers (2024-08-21T03:17:20Z) - MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs)
MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions.
Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z) - EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts [48.214475133206385]
EMMA is a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA.
By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts.
arXiv Detail & Related papers (2024-06-13T14:26:43Z) - MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation [70.83668869857665]
MMTryon is a multi-modal multi-reference VIrtual Try-ON framework.
It can generate high-quality compositional try-on results by taking a text instruction and multiple garment images as inputs.
arXiv Detail & Related papers (2024-05-01T11:04:22Z) - MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds.
MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings.
CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z) - Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing [40.70752781891058]
We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images.
We tackle this problem by proposing a new architecture based on latent diffusion models.
Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
arXiv Detail & Related papers (2023-04-04T18:03:04Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.