Iterative Object Count Optimization for Text-to-image Diffusion Models
- URL: http://arxiv.org/abs/2408.11721v1
- Date: Wed, 21 Aug 2024 15:51:46 GMT
- Title: Iterative Object Count Optimization for Text-to-image Diffusion Models
- Authors: Oz Zafar, Lior Wolf, Idan Schwartz,
- Abstract summary: Current models, which learn from image-text pairs, inherently struggle with counting.
We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential.
We evaluate the generation of various objects and show significant improvements in accuracy.
- Score: 59.03672816121209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object\'s potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at https://ozzafar.github.io/count_token.
Related papers
- Make It Count: Text-to-Image Generation with an Accurate Number of Objects [31.909039527164403]
Control of the number of objects depicted using text is surprisingly hard.
Generating object-correct counts is challenging because the generative model needs to keep a sense of separate identity for every instance of the object.
We show how CountGen can be used to guide denoising with correct object count.
arXiv Detail & Related papers (2024-06-14T17:46:08Z) - Semantic Generative Augmentations for Few-Shot Counting [0.0]
We investigate how synthetic data can benefit few-shot class-agnostic counting.
We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map.
Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models.
arXiv Detail & Related papers (2023-10-26T11:42:48Z) - Counting Guidance for High Fidelity Text-to-Image Synthesis [2.6212127510234797]
Text-to-image diffusion models fail to generate high fidelity content with respect to the input prompt.
E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects.
We propose a method to improve diffusion models to focus on producing the correct object count.
arXiv Detail & Related papers (2023-06-30T11:40:35Z) - PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator.
We create a new training pipeline for object to image matching based on a three-view system.
To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z) - Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes.
We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation.
Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Dynamic Proposals for Efficient Object Detection [48.66093789652899]
We propose a simple yet effective method which is adaptive to different computational resources by generating dynamic proposals for object detection.
Our method achieves significant speed-up across a wide range of detection models including two-stage and query-based models.
arXiv Detail & Related papers (2022-07-12T01:32:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.