Related papers: Attribute-Centric Compositional Text-to-Image Generation

Attribute-Centric Compositional Text-to-Image Generation

URL: http://arxiv.org/abs/2301.01413v1
Date: Wed, 4 Jan 2023 03:03:08 GMT
Title: Attribute-Centric Compositional Text-to-Image Generation
Authors: Yuren Cong, Martin Renqiang Min, Li Erran Li, Bodo Rosenhahn, Michael Ying Yang
Abstract summary: ACTIG is an attribute-centric compositional text-to-image generation framework. We present an attribute-centric feature augmentation and a novel image-free training scheme. We validate our framework on the CelebA-HQ and CUB datasets.
Score: 45.12516226662346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the recent impressive breakthroughs in text-to-image generation, generative models have difficulty in capturing the data distribution of underrepresented attribute compositions while over-memorizing overrepresented attribute compositions, which raises public concerns about their robustness and fairness. To tackle this challenge, we propose ACTIG, an attribute-centric compositional text-to-image generation framework. We present an attribute-centric feature augmentation and a novel image-free training scheme, which greatly improves model's ability to generate images with underrepresented attributes. We further propose an attribute-centric contrastive loss to avoid overfitting to overrepresented attribute compositions. We validate our framework on the CelebA-HQ and CUB datasets. Extensive experiments show that the compositional generalization of ACTIG is outstanding, and our framework outperforms previous works in terms of image quality and text-image consistency.

Related papers

Compositional Attribute Imbalance in Vision Datasets [7.018788111043557]
We introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes.<n>By analyzing both single-attribute imbalance and compositional attribute imbalance, we reveal how the rarity of attributes affects model performance.<n>Our research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks.
arXiv Detail & Related papers (2025-06-17T11:28:07Z)
VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z)
Z-Magic: Zero-shot Multiple Attributes Guided Image Creator [24.88532732093652]
We reformulate multi-attribute creation from a conditional probability theory perspective and tackle the challenging zero-shot setting. By explicitly modeling the dependencies between attributes, we further enhance the coherence of generated images. We identify connections between multi-attribute customization and multi-task learning, effectively addressing the high computing cost encountered in multi-attribute synthesis.
arXiv Detail & Related papers (2025-03-15T13:07:58Z)
TAGE: Trustworthy Attribute Group Editing for Stable Few-shot Image Generation [10.569380190029317]
TAGE is an innovative image generation network comprising three integral modules. The CPM module delves into the semantic dimensions of category-agnostic attributes, encapsulating them within a discrete codebook. The PSM module generates semantic cues that are seamlessly integrated into the Transformer architecture of the CPM.
arXiv Detail & Related papers (2024-10-23T13:26:19Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z)
Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models [46.723653095494896]
We show that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of text-to-image models to generate high-fidelity compositional scenes. Our main finding shows that the best compositional improvements can be achieved without harming the model's FID scores.
arXiv Detail & Related papers (2024-06-12T03:21:34Z)
Attribute-Aware Deep Hashing with Self-Consistency for Large-Scale Fine-Grained Image Retrieval [65.43522019468976]
We propose attribute-aware hashing networks with self-consistency for generating attribute-aware hash codes. We develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors. Our models are equipped with a feature decorrelation constraint upon these attribute vectors to strengthen their representative abilities.
arXiv Detail & Related papers (2023-11-21T08:20:38Z)
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object) We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z)
Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image Generation [18.36261166580862]
Text-to-image generation aims to generate photo-realistic and semantically consistent images according to the given text descriptions. Existing methods mainly extract the text information from only one sentence to represent an image. We propose an effective text representation method with the complements of attribute information.
arXiv Detail & Related papers (2022-09-28T12:28:54Z)
StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [52.341186561026724]
Lacking compositionality could have severe implications for robustness and fairness. We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis. Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
arXiv Detail & Related papers (2022-03-29T17:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.