iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image
Diffusion Model for Interior Design
- URL: http://arxiv.org/abs/2312.04326v2
- Date: Tue, 19 Dec 2023 06:50:21 GMT
- Title: iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image
Diffusion Model for Interior Design
- Authors: Ruyi Gan, Xiaojun Wu, Junyu Lu, Yuanhe Tian, Dixiang Zhang, Ziwei Wu,
Renliang Sun, Chang Liu, Jiaxing Zhang, Pingjian Zhang, Yan Song
- Abstract summary: We propose a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach.
The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach.
- Score: 42.061819736162356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the open-sourcing of text-to-image models (T2I) such as stable diffusion
(SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned
in specific domains based on the open-source SD model, such as in anime,
character portraits, etc. However, there are few specialized models in certain
domains, such as interior design, which is attributed to the complex textual
descriptions and detailed visual elements inherent in design, alongside the
necessity for adaptable resolution. Therefore, text-to-image models for
interior design are required to have outstanding prompt-following capabilities,
as well as iterative collaboration with design professionals to achieve the
desired outcome. In this paper, we collect and optimize text-image data in the
design field and continue training in both English and Chinese on the basis of
the open-source CLIP model. We also proposed a fine-tuning strategy with
curriculum learning and reinforcement learning from CLIP feedback to enhance
the prompt-following capabilities of our approach so as to improve the quality
of image generation. The experimental results on the collected dataset
demonstrate the effectiveness of the proposed approach, which achieves
impressive results and outperforms strong baselines.
Related papers
- Training-Free Sketch-Guided Diffusion with Latent Optimization [22.94468603089249]
We propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition.
To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models.
We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process.
arXiv Detail & Related papers (2024-08-31T00:44:03Z) - REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z) - Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding [9.787025432074978]
This paper introduces Prompt for Abstract Concepts (POAC) to enhance the performance of text-to-image diffusion models.
We propose a Prompt Language Model (PLM), which is curated from a pre-trained language model, and then fine-tuned with a dataset of abstract concept prompts.
Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts.
arXiv Detail & Related papers (2024-04-17T17:38:56Z) - YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Learning Deformable Image Registration from Optimization: Perspective,
Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation.
We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.