Saliency Guided Optimization of Diffusion Latents
- URL: http://arxiv.org/abs/2410.10257v1
- Date: Mon, 14 Oct 2024 08:12:42 GMT
- Title: Saliency Guided Optimization of Diffusion Latents
- Authors: Xiwen Wang, Jizhe Zhou, Xuekang Zhu, Cheng Li, Mao Li,
- Abstract summary: The key to text-to-image generation is how to optimize the results of a text-to-image generation model so that they can be better aligned with human intentions or prompts.
These methods overlook the fact that when viewing an image, the human visual system naturally prioritizes attention toward salient areas, often neglecting less or non-salient regions.
We propose Saliency Guided Optimization Of Latents (SGOOL) to address this alignment challenge effectively and efficiently.
- Score: 9.237421522280819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advances in diffusion models, generating decent images from text prompts is no longer challenging. The key to text-to-image generation is how to optimize the results of a text-to-image generation model so that they can be better aligned with human intentions or prompts. Existing optimization methods commonly treat the entire image uniformly and conduct global optimization. These methods overlook the fact that when viewing an image, the human visual system naturally prioritizes attention toward salient areas, often neglecting less or non-salient regions. That is, humans are likely to neglect optimizations in non-salient areas. Consequently, although model retaining is conducted under the guidance of additional large and multimodality models, existing methods, which perform uniform optimizations, yield sub-optimal results. To address this alignment challenge effectively and efficiently, we propose Saliency Guided Optimization Of Diffusion Latents (SGOOL). We first employ a saliency detector to mimic the human visual attention system and mark out the salient regions. To avoid retraining an additional model, our method directly optimizes the diffusion latents. Besides, SGOOL utilizes an invertible diffusion process and endows it with the merits of constant memory implementation. Hence, our method becomes a parameter-efficient and plug-and-play fine-tuning method. Extensive experiments have been done with several metrics and human evaluation. Experimental results demonstrate the superiority of SGOOL in image quality and prompt alignment.
Related papers
- Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs [30.973473583364832]
DoSSR is a Domain Shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models.
At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models.
Our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring only 5 sampling steps.
arXiv Detail & Related papers (2024-09-26T12:16:11Z) - OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts.
Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module.
Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z) - OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model [6.83367289911244]
Omnidirectional images (ODIs) are commonly used in real-world visual tasks, and high-resolution ODIs help improve the performance of related visual tasks.
Most existing super-resolution methods for ODIs use end-to-end learning strategies, resulting in inferior realness of generated images.
arXiv Detail & Related papers (2024-04-16T06:39:37Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction [60.52716381465063]
We introduce Deceptive-NeRF/3DGS to enhance sparse-view reconstruction with only a limited set of input images.
Specifically, we propose a deceptive diffusion model turning noisy images rendered from few-view reconstructions into high-quality pseudo-observations.
Our system progressively incorporates diffusion-generated pseudo-observations into the training image sets, ultimately densifying the sparse input observations by 5 to 10 times.
arXiv Detail & Related papers (2023-05-24T14:00:32Z) - Optimal Linear Subspace Search: Learning to Construct Fast and
High-Quality Schedulers for Diffusion Models [18.026820439151404]
Key issue currently limiting the application of diffusion models is its extremely slow generation process.
We propose a novel method called Optimal Linear Subspace Search (OLSS)
OLSS is able to generate high-quality images with a very small number of steps.
arXiv Detail & Related papers (2023-05-24T03:33:30Z) - High-Fidelity Clothed Avatar Reconstruction from a Single Image [73.15939963381906]
We propose a coarse-to-fine way to realize a high-fidelity clothed avatar reconstruction from a single image.
We use an implicit model to learn the general shape in the canonical space of a person in a learning-based way.
We refine the surface detail by estimating the non-rigid deformation in the posed space in an optimization way.
arXiv Detail & Related papers (2023-04-08T04:01:04Z) - A Generic Approach for Enhancing GANs by Regularized Latent Optimization [79.00740660219256]
We introduce a generic framework called em generative-model inference that is capable of enhancing pre-trained GANs effectively and seamlessly.
Our basic idea is to efficiently infer the optimal latent distribution for the given requirements using Wasserstein gradient flow techniques.
arXiv Detail & Related papers (2021-12-07T05:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.