Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion
- URL: http://arxiv.org/abs/2508.07755v1
- Date: Mon, 11 Aug 2025 08:36:29 GMT
- Title: Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion
- Authors: Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim,
- Abstract summary: We propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information.<n>We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target.
- Score: 22.481176245267328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation quality.In this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.
Related papers
- ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization [11.472088067393074]
ConceptPrism is a novel framework that automatically disentangles the shared visual concept from image-specific residuals.<n>In experiments, ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.
arXiv Detail & Related papers (2026-02-23T07:46:19Z) - Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models [9.94436942959918]
A text-to-image diffusion model learns a new visual concept from a limited number of reference images.<n>We propose a semantic anchoring that guides adaptation by grounding new concepts in their corresponding distributions.<n>This anchoring encourages the model to adapt new concepts in a stable and controlled manner, expanding the pretrained distribution toward personalized regions.
arXiv Detail & Related papers (2025-11-27T09:16:33Z) - AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment [74.47138661595584]
We propose AlignGen, a Cross-Modality Prior Alignment mechanism for personalized image generation.<n>We show that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.
arXiv Detail & Related papers (2025-05-28T02:57:55Z) - Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase.
We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships.
Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions.
LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z) - Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing
Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model.
We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance.
We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z) - Conditional Score Guidance for Text-Driven Image-to-Image Translation [52.73564644268749]
We present a novel algorithm for text-driven image-to-image translation based on a pretrained text-to-image diffusion model.
Our method aims to generate a target image by selectively editing the regions of interest in a source image.
arXiv Detail & Related papers (2023-05-29T10:48:34Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - DreamArtist++: Controllable One-Shot Text-to-Image Generation via Positive-Negative Adapter [63.622879199281705]
Some example-based image generation approaches have been proposed, emphi.e. generating new concepts based on absorbing the salient features of a few input references.<n>We propose a simple yet effective framework, namely DreamArtist, which adopts a novel positive-negative prompt-tuning learning strategy on the pre-trained diffusion model.<n>We have conducted extensive experiments and evaluated the proposed method from image similarity (fidelity) and diversity, generation controllability, and style cloning.
arXiv Detail & Related papers (2022-11-21T10:37:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.