Related papers: Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models

Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2408.11721v2
Date: Thu, 05 Jun 2025 15:25:55 GMT
Title: Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models
Authors: Oz Zafar, Yuval Cohen, Lior Wolf, Idan Schwartz,
Abstract summary: We propose a new framework that uses pre-trained object counting techniques and object detectors to guide generation.<n>First, we optimize a counting token using an outer-loop loss computed on fully generated images.<n>Second, we introduce a detection-driven scaling term that corrects errors caused by viewpoint and proportion shifts.
Score: 54.641726517633025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurately controlling object count in text-to-image generation remains a key challenge. Supervised methods often fail, as training data rarely covers all count variations. Methods that manipulate the denoising process to add or remove objects can help; however, they still require labeled data, limit robustness and image quality, and rely on a slow, iterative process. Pre-trained differentiable counting models that rely on soft object density summation exist and could steer generation, but employing them presents three main challenges: (i) they are pre-trained on clean images, making them less effective during denoising steps that operate on noisy inputs; (ii) they are not robust to viewpoint changes; and (iii) optimization is computationally expensive, requiring repeated model evaluations per image. We propose a new framework that uses pre-trained object counting techniques and object detectors to guide generation. First, we optimize a counting token using an outer-loop loss computed on fully generated images. Second, we introduce a detection-driven scaling term that corrects errors caused by viewpoint and proportion shifts, among other factors, without requiring backpropagation through the detection model. Third, we show that the optimized parameters can be reused for new prompts, removing the need for repeated optimization. Our method provides efficiency through token reuse, flexibility via compatibility with various detectors, and accuracy with improved counting across diverse object categories.

Related papers

Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting [1.1871535995163365]
Textual Inversion (TI) allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples.<n>The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning.<n>We evaluate whether the method matches or outperforms the baseline methods that suffer from forgetting in a variety of quantitative and qualitative experiments.
arXiv Detail & Related papers (2025-08-07T12:28:08Z)
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS [79.15416002879239]
3D Gaussian Splatting has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images.<n>We propose RobustSplat, a robust solution based on two critical designs.
arXiv Detail & Related papers (2025-06-03T11:13:48Z)
QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain [40.661699970360736]
We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining.
arXiv Detail & Related papers (2024-11-29T08:20:12Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
We propose an algorithm that enables fast and high-quality generation under arbitrary constraints.<n>During inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image.<n>Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks. We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z)
Make It Count: Text-to-Image Generation with an Accurate Number of Objects [31.909039527164403]
Control of the number of objects depicted using text is surprisingly hard. Generating object-correct counts is challenging because the generative model needs to keep a sense of separate identity for every instance of the object. We show how CountGen can be used to guide denoising with correct object count.
arXiv Detail & Related papers (2024-06-14T17:46:08Z)
AdaDiff: Adaptive Step Selection for Fast Diffusion Models [82.78899138400435]
We introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies.<n>AdaDiff is optimized using a policy method to maximize a carefully designed reward function.<n>We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline.
arXiv Detail & Related papers (2023-11-24T11:20:38Z)
Semantic Generative Augmentations for Few-Shot Counting [0.0]
We investigate how synthetic data can benefit few-shot class-agnostic counting. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models.
arXiv Detail & Related papers (2023-10-26T11:42:48Z)
Reducing False Alarms in Video Surveillance by Deep Feature Statistical Modeling [16.311150636417256]
We develop a method-a weakly supervised a-contrario validation process, based on high dimensional statistical modeling of deep features. Experimental results reveal that the proposed a-contrario validation is able to largely reduce the number of false alarms at both pixel and object levels.
arXiv Detail & Related papers (2023-07-09T12:37:17Z)
Counting Guidance for High Fidelity Text-to-Image Synthesis [2.6212127510234797]
Text-to-image diffusion models fail to generate high fidelity content with respect to the input prompt. E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects. We propose a method to improve diffusion models to focus on producing the correct object count.
arXiv Detail & Related papers (2023-06-30T11:40:35Z)
PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator. We create a new training pipeline for object to image matching based on a three-view system. To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z)
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes. We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z)
Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Dynamic Proposals for Efficient Object Detection [48.66093789652899]
We propose a simple yet effective method which is adaptive to different computational resources by generating dynamic proposals for object detection. Our method achieves significant speed-up across a wide range of detection models including two-stage and query-based models.
arXiv Detail & Related papers (2022-07-12T01:32:50Z)
Tackling the Background Bias in Sparse Object Detection via Cropped Windows [17.547911599819837]
We propose a simple tiling method that improves the detection capability in the remote sensing case without modifying the model itself. The procedure was validated on three different data sets and outperformed similar approaches in performance and speed.
arXiv Detail & Related papers (2021-06-04T06:59:56Z)
Powers of layers for image-to-image translation [60.5529622990682]
We propose a simple architecture to address unpaired image-to-image translation tasks. We start from an image autoencoder architecture with fixed weights. For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached.
arXiv Detail & Related papers (2020-08-13T09:02:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.