Related papers: D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

URL: http://arxiv.org/abs/2510.19278v1
Date: Wed, 22 Oct 2025 06:27:05 GMT
Title: D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation
Authors: Nobline Yoo, Olga Russakovsky, Ye Zhu,
Abstract summary: Detector-to-Differentiable (D2D) is a novel framework that transforms non-differentiable detection models into differentiable critics.<n>Our experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD demonstrate consistent and substantial improvements in object counting accuracy.
Score: 26.820694706602236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.

Related papers

$\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection [85.9202830503973]
Visual autoregressive (AR) models generate images through discrete token prediction.<n>We propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D$3$QE) for autoregressive-generated image detection.
arXiv Detail & Related papers (2025-10-07T13:02:27Z)
One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z)
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity.<n>In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator.<n>In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z)
Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help [18.70937620674227]
We introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models.<n>Our evaluations reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases.
arXiv Detail & Related papers (2025-03-10T03:28:18Z)
Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models [54.641726517633025]
We propose a new framework that uses pre-trained object counting techniques and object detectors to guide generation.<n>First, we optimize a counting token using an outer-loop loss computed on fully generated images.<n>Second, we introduce a detection-driven scaling term that corrects errors caused by viewpoint and proportion shifts.
arXiv Detail & Related papers (2024-08-21T15:51:46Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
Beyond the Benchmark: Detecting Diverse Anomalies in Videos [0.6993026261767287]
Video Anomaly Detection (VAD) plays a crucial role in modern surveillance systems, aiming to identify various anomalies in real-world situations. Current benchmark datasets predominantly emphasize simple, single-frame anomalies such as novel object detection. We advocate for an expansion of VAD investigations to encompass intricate anomalies that extend beyond conventional benchmark boundaries.
arXiv Detail & Related papers (2023-10-03T09:22:06Z)
A Dual Attentive Generative Adversarial Network for Remote Sensing Image Change Detection [6.906936669510404]
We propose a dual attentive generative adversarial network for achieving very high-resolution remote sensing image change detection tasks. The DAGAN framework has better performance with 85.01% mean IoU and 91.48% mean F1 score than advanced methods on the LEVIR dataset.
arXiv Detail & Related papers (2023-10-03T08:26:27Z)
Noise-Tolerant Few-Shot Unsupervised Adapter for Vision-Language Models [8.59772105902647]
We design NtUA, a Noise-tolerant Unsupervised Adapter that allows the learning of effective target models with few unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few unlabelled target samples as key-value pairs. NtUA achieves superior performance consistently across multiple widely adopted benchmarks.
arXiv Detail & Related papers (2023-09-26T13:35:31Z)
Deep Metric Learning for Unsupervised Remote Sensing Change Detection [60.89777029184023]
Remote Sensing Change Detection (RS-CD) aims to detect relevant changes from Multi-Temporal Remote Sensing Images (MT-RSIs) The performance of existing RS-CD methods is attributed to training on large annotated datasets. This paper proposes an unsupervised CD method based on deep metric learning that can deal with both of these issues.
arXiv Detail & Related papers (2023-03-16T17:52:45Z)
Benchmarking Robustness of Deep Learning Classifiers Using Two-Factor Perturbation [4.016928101928335]
This paper adds to the fundamental body of work on benchmarking the robustness of deep learning (DL) classifiers. Also, we introduce a new four-quadrant statistical visualization tool, including minimum accuracy, maximum accuracy, mean accuracy, and coefficient of variation. All source codes, related image sets, and preliminary data, are shared on a GitHub website to support future academic research and industry projects.
arXiv Detail & Related papers (2021-03-02T02:10:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.