Related papers: ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

URL: http://arxiv.org/abs/2408.14339v1
Date: Mon, 26 Aug 2024 15:08:12 GMT
Title: ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty
Authors: Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, Sanjeev Arora,
Abstract summary: ConceptMix is a scalable, controllable, and customizable benchmark. It automatically evaluates compositional generation ability of Text-to-Image (T2I) models. It reveals that the performance of several models, especially open models, drops dramatically with increased k.
Score: 52.15933752463479
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT4-o to generate text prompts for image generation based on these sampled concepts. Second, ConceptMix evaluates the images generated in response to these prompts: concretely, it checks how many of the k concepts actually appeared in the image by generating one question per visual concept and using a strong VLM to answer them. Through administering ConceptMix to a diverse set of T2I models (proprietary as well as open ones) using increasing values of k, we show that our ConceptMix has higher discrimination power than earlier benchmarks. Specifically, ConceptMix reveals that the performance of several models, especially open models, drops dramatically with increased k. Importantly, it also provides insight into the lack of prompt diversity in widely-used training datasets. Additionally, we conduct extensive human studies to validate the design of ConceptMix and compare our automatic grading with human judgement. We hope it will guide future T2I model development.

Related papers

MC-LLaVA: Multi-Concept Personalized Vision-Language Model [51.645660375766575]
This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses.
arXiv Detail & Related papers (2025-03-24T16:32:17Z)
MC-LLaVA: Multi-Concept Personalized Vision-Language Model [51.645660375766575]
This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses.
arXiv Detail & Related papers (2024-11-18T16:33:52Z)
MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models [51.1034358143232]
We introduce component-controllable personalization, a novel task that pushes the boundaries of text-to-image (T2I) models. To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg) to dynamically perturb undesired visual semantics.
arXiv Detail & Related papers (2024-10-17T09:22:53Z)
Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models [24.851041038347784]
Text-to-image (T2I) models are increasingly used in real-life applications. There is a growing need to audit these models to ensure that they generate desirable, task-appropriate images. We propose Concept2Concept, a framework where we characterize conditional distributions of vision language models.
arXiv Detail & Related papers (2024-10-06T21:42:53Z)
AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization [3.5066393042242123]
We propose AttenCraft, an attention-based method for multiple-concept disentanglement.<n>We introduce an adaptive algorithm based on attention scores to estimate sampling ratios for different concepts.<n>Our model effectively mitigates two issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.
arXiv Detail & Related papers (2024-05-28T08:50:14Z)
Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs) Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z)
MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation [49.935634230341904]
We introduce the Multi-concept guidance for Multi-concept customization, termed MC$2$, for improved flexibility and fidelity. MC$2$ decouples the requirements for model architecture via inference time optimization. It adaptively refines the attention weights between visual and textual tokens, directing image regions to focus on their associated words.
arXiv Detail & Related papers (2024-04-08T07:59:04Z)
Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition [47.07564907486087]
Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models.
arXiv Detail & Related papers (2024-02-23T18:55:09Z)
Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images. Our method incorporates a novel cross-attention guidance to decompose multiple concepts. Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z)
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models [79.10890337599166]
We introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts and 33K composite text prompts. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome.
arXiv Detail & Related papers (2023-06-07T18:00:38Z)
Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition. We propose augmenting the input image with masks that indicate the presence of target concepts. We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.