Related papers: Iterative Object Count Optimization for Text-to-image Diffusion Models

Iterative Object Count Optimization for Text-to-image Diffusion Models

URL: http://arxiv.org/abs/2408.11721v1
Date: Wed, 21 Aug 2024 15:51:46 GMT
Title: Iterative Object Count Optimization for Text-to-image Diffusion Models
Authors: Oz Zafar, Lior Wolf, Idan Schwartz,
Abstract summary: Current models, which learn from image-text pairs, inherently struggle with counting. We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential. We evaluate the generation of various objects and show significant improvements in accuracy.
Score: 59.03672816121209
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object\'s potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at https://ozzafar.github.io/count_token.

Related papers

Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting [1.1871535995163365]
Textual Inversion (TI) allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples.<n>The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning.<n>We evaluate whether the method matches or outperforms the baseline methods that suffer from forgetting in a variety of quantitative and qualitative experiments.
arXiv Detail & Related papers (2025-08-07T12:28:08Z)
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS [79.15416002879239]
3D Gaussian Splatting has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images.<n>We propose RobustSplat, a robust solution based on two critical designs.
arXiv Detail & Related papers (2025-06-03T11:13:48Z)
QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain [40.661699970360736]
We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining.
arXiv Detail & Related papers (2024-11-29T08:20:12Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
We propose an algorithm that enables fast and high-quality generation under arbitrary constraints.<n>During inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image.<n>Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks. We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z)
Make It Count: Text-to-Image Generation with an Accurate Number of Objects [31.909039527164403]
Control of the number of objects depicted using text is surprisingly hard. Generating object-correct counts is challenging because the generative model needs to keep a sense of separate identity for every instance of the object. We show how CountGen can be used to guide denoising with correct object count.
arXiv Detail & Related papers (2024-06-14T17:46:08Z)
AdaDiff: Adaptive Step Selection for Fast Diffusion Models [82.78899138400435]
We introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies.<n>AdaDiff is optimized using a policy method to maximize a carefully designed reward function.<n>We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline.
arXiv Detail & Related papers (2023-11-24T11:20:38Z)
Semantic Generative Augmentations for Few-Shot Counting [0.0]
We investigate how synthetic data can benefit few-shot class-agnostic counting. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models.
arXiv Detail & Related papers (2023-10-26T11:42:48Z)
Reducing False Alarms in Video Surveillance by Deep Feature Statistical Modeling [16.311150636417256]
We develop a method-a weakly supervised a-contrario validation process, based on high dimensional statistical modeling of deep features. Experimental results reveal that the proposed a-contrario validation is able to largely reduce the number of false alarms at both pixel and object levels.
arXiv Detail & Related papers (2023-07-09T12:37:17Z)
Counting Guidance for High Fidelity Text-to-Image Synthesis [2.6212127510234797]
Text-to-image diffusion models fail to generate high fidelity content with respect to the input prompt. E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects. We propose a method to improve diffusion models to focus on producing the correct object count.
arXiv Detail & Related papers (2023-06-30T11:40:35Z)
PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator. We create a new training pipeline for object to image matching based on a three-view system. To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z)
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes. We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z)
Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Dynamic Proposals for Efficient Object Detection [48.66093789652899]
We propose a simple yet effective method which is adaptive to different computational resources by generating dynamic proposals for object detection. Our method achieves significant speed-up across a wide range of detection models including two-stage and query-based models.
arXiv Detail & Related papers (2022-07-12T01:32:50Z)
Tackling the Background Bias in Sparse Object Detection via Cropped Windows [17.547911599819837]
We propose a simple tiling method that improves the detection capability in the remote sensing case without modifying the model itself. The procedure was validated on three different data sets and outperformed similar approaches in performance and speed.
arXiv Detail & Related papers (2021-06-04T06:59:56Z)
Powers of layers for image-to-image translation [60.5529622990682]
We propose a simple architecture to address unpaired image-to-image translation tasks. We start from an image autoencoder architecture with fixed weights. For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached.
arXiv Detail & Related papers (2020-08-13T09:02:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.