Related papers: Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

URL: http://arxiv.org/abs/2512.07702v2
Date: Thu, 11 Dec 2025 06:42:25 GMT
Title: Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment
Authors: Sangha Park, Eunji Kim, Yeongtak Oh, Jooyoung Choi, Sungroh Yoon,
Abstract summary: Negative Prompting for Image Correction identifies and applies negative prompts that suppress unintended content.<n>NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models.
Score: 48.52952246580342
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.

Related papers

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs [27.133240420463807]
AlignBench is a benchmark that provides a new indicator of image-text alignment.<n>It evaluates detailed image-caption pairs generated by diverse image-to-text and text-to-image models.<n>Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators.
arXiv Detail & Related papers (2025-11-25T17:19:47Z)
Conceptrol: Concept Control of Zero-shot Personalized Image Generation [36.39574513193442]
Conceptrol is a framework that enhances zero-shot adapters without adding computational overhead.<n>It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter.
arXiv Detail & Related papers (2025-03-09T11:54:08Z)
Negative Token Merging: Image-based Adversarial Feature Guidance [114.65069052244088]
We introduce negative token merging (NegToMe) to perform adversarial guidance through images.<n>NegToMe selectively pushes apart matching visual features between reference and generated images during the reverse diffusion process.<n>It significantly enhances output diversity and reduces visual similarity to copyrighted content by 34.57%.
arXiv Detail & Related papers (2024-12-02T10:06:57Z)
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment. Our approach focuses on generating high-quality training datasets for the alignment task. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z)
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights.<n>We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.<n>We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z)
DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images [55.546024767130994]
We propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream.
arXiv Detail & Related papers (2024-04-27T22:45:47Z)
Evaluating Text-to-Visual Generation with Image-to-Text Generation [113.07368313330994]
VQAScore is a visual-question-answering (VQA) model to produce an alignment score. It produces state-of-the-art results across many (8) image-text alignment benchmarks. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts.
arXiv Detail & Related papers (2024-04-01T17:58:06Z)
Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation [1.4138057640459576]
We propose NegOpt, a novel method for optimizing negative prompt generation toward enhanced image generation. Our combined approach results in a substantial increase of 25% in Inception Score compared to other approaches.
arXiv Detail & Related papers (2024-03-12T12:44:34Z)
Universal Prompt Optimizer for Safe Text-to-Image Generation [27.32589928097192]
We propose the first universal prompt for safe T2I (POSI) generation in black-box scenario.<n>Our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images.
arXiv Detail & Related papers (2024-02-16T18:36:36Z)
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.