Consistent Subject Generation via Contrastive Instantiated Concepts
- URL: http://arxiv.org/abs/2503.24387v1
- Date: Mon, 31 Mar 2025 17:59:51 GMT
- Title: Consistent Subject Generation via Contrastive Instantiated Concepts
- Authors: Lee Hsin-Ying, Kelvin C. K. Chan, Ming-Hsuan Yang,
- Abstract summary: We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations.<n>The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts.
- Score: 59.95616194326261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.
Related papers
- UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding [84.87802580670579]
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations.
Our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information.
arXiv Detail & Related papers (2025-04-06T09:20:49Z) - DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis [3.6294581578004332]
We propose DynASyn, an effective multi-subject personalization from a single reference image.<n>DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions.<n>In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity.
arXiv Detail & Related papers (2025-03-22T10:56:35Z) - WeGen: A Unified Model for Interactive Multimodal Generation as We Chat [51.78489661490396]
We introduce WeGen, a model that unifies multimodal generation and understanding.<n>It can generate diverse results with high creativity for less detailed instructions.<n>We show it achieves state-of-the-art performance across various visual generation benchmarks.
arXiv Detail & Related papers (2025-03-03T02:50:07Z) - Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models [25.84386438333865]
We show that concepts and classes form a complex web of relationships, which is susceptible to degradation and needs to be preserved and augmented across experiences.<n>We propose a novel method - MuCIL - that uses multimodal concepts to perform classification without increasing the number of trainable parameters across experiences.
arXiv Detail & Related papers (2025-02-27T18:59:29Z) - Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation [39.93527514513576]
Creative'' remains an inherently abstract concept for both humans and diffusion models.<n>Current methods rely heavily on reference prompts or images to achieve a creative effect.<n>We introduce CreTok, which brings meta-creativity to diffusion models by redefining creative' as a new token, texttCreTok>.<n>Code will be made available at https://github.com/fu-feng/CreTok.
arXiv Detail & Related papers (2024-10-31T17:19:03Z) - Enhancing Graph Contrastive Learning with Reliable and Informative Augmentation for Recommendation [84.45144851024257]
We propose a novel framework that aims to enhance graph contrastive learning by constructing contrastive views with stronger collaborative information via discrete codes.<n>The core idea is to map users and items into discrete codes rich in collaborative information for reliable and informative contrastive view generation.
arXiv Detail & Related papers (2024-09-09T14:04:17Z) - Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis [14.21719970175159]
Concept Conductor is designed to ensure visual fidelity and correct layout in multi-concept customization.
We present a concept injection technique that employs shape-aware masks to specify the generation area for each concept.
Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts.
arXiv Detail & Related papers (2024-08-07T08:43:58Z) - MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation [59.00909718832648]
We propose MC$2$, a novel approach for multi-concept customization.<n>By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts.<n>Experiments demonstrate that MC$2$ outperforms training-based methods in terms of prompt-reference alignment.
arXiv Detail & Related papers (2024-04-08T07:59:04Z) - ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior
Constraints [56.824187892204314]
We present the task of creative text-to-image generation, where we seek to generate new members of a broad category.
We show that the creative generation problem can be formulated as an optimization process over the output space of the diffusion prior.
We incorporate a question-answering Vision-Language Model (VLM) that adaptively adds new constraints to the optimization problem, encouraging the model to discover increasingly more unique creations.
arXiv Detail & Related papers (2023-08-03T17:04:41Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.