A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models
- URL: http://arxiv.org/abs/2503.13576v1
- Date: Mon, 17 Mar 2025 13:51:56 GMT
- Title: A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models
- Authors: Ziqiang Li, Jun Li, Lizhi Xiong, Zhangjie Fu, Zechao Li,
- Abstract summary: This paper categorizes existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination.<n>We identify key challenges and propose future research directions to propel this important and interesting field forward.
- Score: 23.538505578316204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.
Related papers
- OmniPrism: Learning Disentangled Visual Concept for Image Generation [57.21097864811521]
Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes.<n>We propose OmniPrism, a visual concept disentangling approach for creative image generation.<n>Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts.
arXiv Detail & Related papers (2024-12-16T18:59:52Z) - Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion [51.931083971448885]
We propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images.
Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.
arXiv Detail & Related papers (2024-07-17T05:21:41Z) - ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction [20.43411883845885]
We introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts.
Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models.
We present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects.
arXiv Detail & Related papers (2024-07-09T17:50:28Z) - Conceptual Codebook Learning for Vision-Language Models [27.68834532978939]
We propose Codebook Learning (CoCoLe) to address the challenge of improving the generalization capability of vision-language models (VLMs)
We learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values.
We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities.
arXiv Detail & Related papers (2024-07-02T15:16:06Z) - Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs)
Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one.
We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z) - Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.
Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.
However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.
We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z) - Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images.
Our method incorporates a novel cross-attention guidance to decompose multiple concepts.
Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z) - Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing
Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model.
We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance.
We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z) - Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images.
We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z) - Visual Concepts Tokenization [65.61987357146997]
We propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens.
To obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens.
We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts.
arXiv Detail & Related papers (2022-05-20T11:25:31Z) - Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal
Frames [1.4502611532302037]
Social concepts referring to non-physical objects are powerful tools to describe, index, and query the content of visual data.
We propose a software approach to represent social concepts as multimodal frames, by integrating multisensory data.
Our method focuses on the extraction, analysis, and integration of multimodal features from visual art material tagged with the concepts of interest.
arXiv Detail & Related papers (2021-10-14T14:50:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.