iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image
Diffusion Model for Interior Design
- URL: http://arxiv.org/abs/2312.04326v2
- Date: Tue, 19 Dec 2023 06:50:21 GMT
- Title: iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image
Diffusion Model for Interior Design
- Authors: Ruyi Gan, Xiaojun Wu, Junyu Lu, Yuanhe Tian, Dixiang Zhang, Ziwei Wu,
Renliang Sun, Chang Liu, Jiaxing Zhang, Pingjian Zhang, Yan Song
- Abstract summary: We propose a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach.
The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach.
- Score: 42.061819736162356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the open-sourcing of text-to-image models (T2I) such as stable diffusion
(SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned
in specific domains based on the open-source SD model, such as in anime,
character portraits, etc. However, there are few specialized models in certain
domains, such as interior design, which is attributed to the complex textual
descriptions and detailed visual elements inherent in design, alongside the
necessity for adaptable resolution. Therefore, text-to-image models for
interior design are required to have outstanding prompt-following capabilities,
as well as iterative collaboration with design professionals to achieve the
desired outcome. In this paper, we collect and optimize text-image data in the
design field and continue training in both English and Chinese on the basis of
the open-source CLIP model. We also proposed a fine-tuning strategy with
curriculum learning and reinforcement learning from CLIP feedback to enhance
the prompt-following capabilities of our approach so as to improve the quality
of image generation. The experimental results on the collected dataset
demonstrate the effectiveness of the proposed approach, which achieves
impressive results and outperforms strong baselines.
Related papers
- Advanced Multimodal Deep Learning Architecture for Image-Text Matching [33.8315200009152]
Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship.
We introduce an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding.
Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets.
arXiv Detail & Related papers (2024-06-13T08:32:24Z) - Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding [9.787025432074978]
This paper introduces Prompt for Abstract Concepts (POAC) to enhance the performance of text-to-image diffusion models.
We propose a Prompt Language Model (PLM), which is curated from a pre-trained language model, and then fine-tuned with a dataset of abstract concept prompts.
Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts.
arXiv Detail & Related papers (2024-04-17T17:38:56Z) - YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - A Generic Approach for Enhancing GANs by Regularized Latent Optimization [79.00740660219256]
We introduce a generic framework called em generative-model inference that is capable of enhancing pre-trained GANs effectively and seamlessly.
Our basic idea is to efficiently infer the optimal latent distribution for the given requirements using Wasserstein gradient flow techniques.
arXiv Detail & Related papers (2021-12-07T05:22:50Z) - Learning Deformable Image Registration from Optimization: Perspective,
Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation.
We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.