Related papers: Divide & Bind Your Attention for Improved Generative Semantic Nursing

Divide & Bind Your Attention for Improved Generative Semantic Nursing

URL: http://arxiv.org/abs/2307.10864v3
Date: Sun, 14 Jul 2024 16:20:19 GMT
Title: Divide & Bind Your Attention for Improved Generative Semantic Nursing
Authors: Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva,
Abstract summary: We propose Divide & Bind to address the challenges posed by complex prompts and scenarios involving multiple entities. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts.
Score: 19.67265541441422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., "a cat and a dog". However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks.

Related papers

Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding [5.71478837100808]
Large Vision-Language Models (LVLMs) generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones.<n>We propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map.<n>Our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.
arXiv Detail & Related papers (2025-05-23T06:35:43Z)
VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
XPrompt:Explaining Large Language Model's Generation via Joint Prompt Attribution [26.639271355209104]
Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. The contribution of the input prompt to the generated content still remains obscure to humans. We introduce a counterfactual explanation framework based on joint prompt attribution, XPrompt.
arXiv Detail & Related papers (2024-05-30T18:16:41Z)
RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation [27.711809069547808]
RepSGG is proposed to formulate a subject as queries, an object as keys, and their relationship as the maximum attention weight between pairwise queries and keys. With more fine-grained and flexible representation power for entities and relationships, RepSGG learns to sample semantically discriminative and representative points for relationship inference. RepSGG achieves the state-of-the-art or comparable performance on the Visual Genome and Open Images V6 datasets with fast inference speed.
arXiv Detail & Related papers (2023-09-06T05:37:19Z)
Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment [87.1732801732059]
Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. We propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifier. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods.
arXiv Detail & Related papers (2023-06-15T06:21:44Z)
Improving the Robustness of Summarization Systems with Dual Augmentation [68.53139002203118]
A robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input. We first explore the summarization models' robustness against perturbations including word-level synonym substitution and noise. We propose a SummAttacker, which is an efficient approach to generating adversarial samples based on language models.
arXiv Detail & Related papers (2023-06-01T19:04:17Z)
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models [103.61066310897928]
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. We introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness
arXiv Detail & Related papers (2023-01-31T18:10:38Z)
UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection [52.91782218300844]
We propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT. Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning.
arXiv Detail & Related papers (2022-10-23T15:24:47Z)
Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction. We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z)
A Context-Aware Feature Fusion Framework for Punctuation Restoration [28.38472792385083]
We propose a novel Feature Fusion framework based on two-type Attentions (FFA) to alleviate the shortage of attention. Experiments on the popular benchmark dataset IWSLT demonstrate that our approach is effective.
arXiv Detail & Related papers (2022-03-23T15:29:28Z)
Adversarial Semantic Data Augmentation for Human Pose Estimation [96.75411357541438]
We propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. We also propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. State-of-the-art results are achieved on challenging benchmarks.
arXiv Detail & Related papers (2020-08-03T07:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.