Related papers: ReGround: Improving Textual and Spatial Grounding at No Cost

ReGround: Improving Textual and Spatial Grounding at No Cost

URL: http://arxiv.org/abs/2403.13589v3
Date: Fri, 19 Jul 2024 04:46:24 GMT
Title: ReGround: Improving Textual and Spatial Grounding at No Cost
Authors: Phillip Y. Lee, Minhyuk Sung,
Abstract summary: spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture.
Score: 12.944046673902415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.

Related papers

Control and Realism: Best of Both Worlds in Layout-to-Image without Training [59.16447569868382]
We present WinWinLay, a training-free method for layout-to-Image generation.<n>We propose two key strategies, Non-local Attention Energy and Adaptive Update, that collaboratively enhance control precision and realism.<n>WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
arXiv Detail & Related papers (2025-06-18T15:39:02Z)
Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing [21.95149344518237]
Learning-based deformable image registration (DIR) alignment accelerates by amortizing traditional optimization via neural networks.<n>We introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass.<n>Preliminary results on a retinal vessel dataset demonstrate our method reduces registration error to 1.88 pixels on 2912x2 images.
arXiv Detail & Related papers (2025-06-12T15:26:03Z)
GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation [11.517082612850443]
We introduce GrounDiT, a training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT) We leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing.
arXiv Detail & Related papers (2024-10-27T15:30:45Z)
Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
SyntStereo2Real: Edge-Aware GAN for Remote Sensing Image-to-Image Translation while Maintaining Stereo Constraint [1.8749305679160366]
Current methods involve combining two networks, an unpaired image-to-image translation network and a stereo-matching network. We propose an edge-aware GAN-based network that effectively tackles both tasks simultaneously. We demonstrate that our model produces qualitatively and quantitatively superior results than existing models, and its applicability extends to diverse domains.
arXiv Detail & Related papers (2024-04-14T14:58:52Z)
R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z)
Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement [52.80968034977751]
Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. We propose the Semantic Prior Refinement Model (SPRM), whose predictions are obtained by combining the output of two main modules. Our approach shows state-of-the-art results on two popular datasets, Flickr30k Entities and ReferIt, with a 9.6% absolute improvement.
arXiv Detail & Related papers (2023-05-18T12:25:07Z)
Few Shot Generative Model Adaption via Relaxed Spatial Structural Alignment [130.84010267004803]
Training a generative adversarial network (GAN) with limited data has been a challenging task. A feasible solution is to start with a GAN well-trained on a large scale source domain and adapt it to the target domain with a few samples, termed as few shot generative model adaption. We propose a relaxed spatial structural alignment method to calibrate the target generative models during the adaption.
arXiv Detail & Related papers (2022-03-06T14:26:25Z)
Distributed Attention for Grounded Image Captioning [55.752968732796354]
We study the problem of weakly supervised grounded image captioning. The goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
arXiv Detail & Related papers (2021-08-02T17:28:33Z)
Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields. To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss. We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.