Related papers: M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing

M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing

URL: http://arxiv.org/abs/2205.11705v1
Date: Tue, 24 May 2022 01:18:14 GMT
Title: M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing
Authors: Zhikang Li, Huiling Zhou, Shuai Bai, Peike Li, Chang Zhou, Hongxia Yang
Abstract summary: We adapt style prior knowledge and flexibility of multi-modal control into one unified two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion design. M6-Fashion utilizes self-correction for the non-autoregressive generation to improve inference speed, enhance holistic consistency, and support various signal controls.
Score: 51.033376763225675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The fashion industry has diverse applications in multi-modal image generation and editing. It aims to create a desired high-fidelity image with the multi-modal conditional signal as guidance. Most existing methods learn different condition guidance controls by introducing extra models or ignoring the style prior knowledge, which is difficult to handle multiple signal combinations and faces a low-fidelity problem. In this paper, we adapt both style prior knowledge and flexibility of multi-modal control into one unified two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion design. It decouples style codes in both spatial and semantic dimensions to guarantee high-fidelity image generation in the first stage. M6-Fashion utilizes self-correction for the non-autoregressive generation to improve inference speed, enhance holistic consistency, and support various signal controls. Extensive experiments on a large-scale clothing dataset M2C-Fashion demonstrate superior performances on various image generation and editing tasks. M6-Fashion model serves as a highly potential AI designer for the fashion industry.

Related papers

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset [140.1967962502411]
We introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features.<n>A sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation offers practical advantages.<n>Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.
arXiv Detail & Related papers (2025-05-14T17:11:07Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting [39.50293003775675]
We propose a novel garment-centric outpainting (GCO) framework based on the latent diffusion model (LDM) The proposed framework aims at customizing a fashion model wearing a given garment via text prompts and facial images.
arXiv Detail & Related papers (2025-03-03T08:30:37Z)
UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation [29.489516715874306]
We present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks.
arXiv Detail & Related papers (2024-08-21T03:17:20Z)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs) MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z)
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts [48.214475133206385]
EMMA is a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts.
arXiv Detail & Related papers (2024-06-13T14:26:43Z)
MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation [70.83668869857665]
MMTryon is a multi-modal multi-reference VIrtual Try-ON framework. It can generate high-quality compositional try-on results by taking a text instruction and multiple garment images as inputs.
arXiv Detail & Related papers (2024-05-01T11:04:22Z)
MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z)
Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z)
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing [40.70752781891058]
We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images. We tackle this problem by proposing a new architecture based on latent diffusion models. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
arXiv Detail & Related papers (2023-04-04T18:03:04Z)
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce. We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.