InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2512.17851v1
- Date: Fri, 19 Dec 2025 17:52:43 GMT
- Title: InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
- Authors: Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad,
- Abstract summary: InfSplign is a training-free inference-time method for text-to-image models.<n>It improves spatial alignment by adjusting the noise through a compound loss in every denoising step.<n>It achieves substantial performance gains over the strongest existing inference-time baselines.
- Score: 27.206678799411645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
Related papers
- SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation [9.212970624261272]
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts.<n>We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt.<n>Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization.
arXiv Detail & Related papers (2025-08-19T14:31:15Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models [18.89863162308386]
CoMPaSS is a versatile framework that enhances spatial understanding in T2I models.<n>It first addresses data ambiguity with the Spatial Constraints-Oriented Pairing (SCOP) data engine.<n>To leverage these priors, CoMPaSS also introduces the Token ENcoding ORdering (TENOR) module.
arXiv Detail & Related papers (2024-12-17T18:59:50Z) - Training-Free Layout-to-Image Generation with Marginal Attention Constraints [73.55660250459132]
We propose a training-free layout-to-image (L2I) approach, which eliminates the need for additional modules or fine-tuning.<n>Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions.<n>We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z) - Improving Consistency in Diffusion Models for Image Super-Resolution [28.945663118445037]
We observe two kinds of inconsistencies in diffusion-based methods.<n>We introduce ConsisSR to handle both semantic and training-inference consistencies.<n>Our method demonstrates state-of-the-art performance among existing diffusion models.
arXiv Detail & Related papers (2024-10-17T17:41:52Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z) - Boosting Few-shot Fine-grained Recognition with Background Suppression
and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples.
We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric.
Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z) - Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network.
We explicitly estimate the foreground regions to suppress the effect of background clutter.
We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.