Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2411.07132v1
- Date: Mon, 11 Nov 2024 17:05:15 GMT
- Title: Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
- Authors: Taihang Hu, Linxuan Li, Joost van de Weijer, Hongcheng Gao, Fahad Shahbaz Khan, Jian Yang, Ming-Ming Cheng, Kai Wang, Yaxing Wang,
- Abstract summary: Text-to-image (T2I) models often fail to accurately bind semantically related objects or attributes in the input prompts.
We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token.
- Score: 98.21700880115938
- License:
- Abstract: Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \url{https://github.com/hutaihang/ToMe}.
Related papers
- STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM [59.08493154172207]
We propose a unified framework to streamline the semantic tokenization and generative recommendation process.
We formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task.
All these tasks are framed in a generative manner and trained using a single large language model (LLM) backbone.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer [51.260384040953326]
Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios.
We propose a position forest transformer (PosFormer) for HMER, which jointly optimize two tasks: expression recognition and position recognition.
PosFormer consistently outperforms the state-of-the-art methods 2.03%/1.22%/2, 1.83%, and 4.62% gains on datasets.
arXiv Detail & Related papers (2024-07-10T15:42:58Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Box It to Bind It: Unified Layout Control and Attribute Binding in T2I
Diffusion Models [28.278822620442774]
Box-it-to-Bind-it (B2B) is a training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models.
B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance.
B2B is designed as a compatible plug-and-play module for existing T2I models.
arXiv Detail & Related papers (2024-02-27T21:51:32Z) - DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization [31.960807999301196]
We propose a novel plug-in method, called DreamMatcher, which reformulates T2I personalization as semantic matching.
Specifically, DreamMatcher replaces the target values with reference values aligned by semantic matching, while leaving the structure path unchanged.
We also introduce a semantic-consistent masking strategy to isolate the personalized concept from irrelevant regions introduced by the target prompts.
arXiv Detail & Related papers (2024-02-15T09:21:16Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Learning Implicit Entity-object Relations by Bidirectional Generative
Alignment for Multimodal NER [43.425998295991135]
We propose a bidirectional generative alignment method named BGA-MNER to tackle these issues.
Our method achieves state-of-the-art performance without image input during inference.
arXiv Detail & Related papers (2023-08-03T10:37:20Z) - Context-LGM: Leveraging Object-Context Relation for Context-Aware Object
Recognition [48.5398871460388]
We propose a novel Contextual Latent Generative Model (Context-LGM), which considers the object-context relation and models it in a hierarchical manner.
To infer contextual features, we reformulate the objective function of Variational Auto-Encoder (VAE), where contextual features are learned as a posterior conditioned distribution on the object.
The effectiveness of our method is verified by state-of-the-art performance on two context-aware object recognition tasks.
arXiv Detail & Related papers (2021-10-08T11:31:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.