Related papers: InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2512.17851v1
Date: Fri, 19 Dec 2025 17:52:43 GMT
Title: InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
Authors: Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad,
Abstract summary: InfSplign is a training-free inference-time method for text-to-image models.<n>It improves spatial alignment by adjusting the noise through a compound loss in every denoising step.<n>It achieves substantial performance gains over the strongest existing inference-time baselines.
Score: 27.206678799411645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

Related papers

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation [9.212970624261272]
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts.<n>We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt.<n>Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization.
arXiv Detail & Related papers (2025-08-19T14:31:15Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models [18.89863162308386]
CoMPaSS is a versatile framework that enhances spatial understanding in T2I models.<n>It first addresses data ambiguity with the Spatial Constraints-Oriented Pairing (SCOP) data engine.<n>To leverage these priors, CoMPaSS also introduces the Token ENcoding ORdering (TENOR) module.
arXiv Detail & Related papers (2024-12-17T18:59:50Z)
Training-Free Layout-to-Image Generation with Marginal Attention Constraints [73.55660250459132]
We propose a training-free layout-to-image (L2I) approach, which eliminates the need for additional modules or fine-tuning.<n>Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions.<n>We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z)
Improving Consistency in Diffusion Models for Image Super-Resolution [28.945663118445037]
We observe two kinds of inconsistencies in diffusion-based methods.<n>We introduce ConsisSR to handle both semantic and training-inference consistencies.<n>Our method demonstrates state-of-the-art performance among existing diffusion models.
arXiv Detail & Related papers (2024-10-17T17:41:52Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. One critical limitation of these models is the low fidelity of generated images with respect to the text description. We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z)
Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric. Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z)
Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network. We explicitly estimate the foreground regions to suppress the effect of background clutter. We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.