Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention
- URL: http://arxiv.org/abs/2506.24085v2
- Date: Mon, 14 Jul 2025 13:42:45 GMT
- Title: Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention
- Authors: Wonwoong Cho, Yanxia Zhang, Yan-Ying Chen, David I. Inouye,
- Abstract summary: Cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation.<n>We propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity.
- Score: 11.686174382596667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.
Related papers
- Blending Concepts with Text-to-Image Diffusion Models [48.68800153838679]
Diffusion models have advanced text-to-image generation in recent years, translating abstract concepts into high-fidelity images with remarkable ease.<n>In this work, we examine whether they can also blend distinct concepts, ranging from concrete objects to intangible ideas, into coherent new visual entities under a zero-shot framework.<n>We show that modern diffusion models indeed exhibit creative blending capabilities without further training or fine-tuning.
arXiv Detail & Related papers (2025-06-30T08:53:30Z) - IP-Composer: Semantic Composition of Visual Concepts [49.18472621931207]
We present IP-Composer, a training-free approach for compositional image generation.<n>Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding.<n>We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text.
arXiv Detail & Related papers (2025-02-19T18:49:31Z) - VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control [8.685610154314459]
diffusion models show extraordinary talents in text-to-image generation, but they may still fail to generate highly aesthetic images.<n>We propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter.<n>Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method.
arXiv Detail & Related papers (2024-12-30T08:47:25Z) - OmniPrism: Learning Disentangled Visual Concept for Image Generation [57.21097864811521]
Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes.<n>We propose OmniPrism, a visual concept disentangling approach for creative image generation.<n>Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts.
arXiv Detail & Related papers (2024-12-16T18:59:52Z) - Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase.
We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships.
Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z) - Financial Models in Generative Art: Black-Scholes-Inspired Concept Blending in Text-to-Image Diffusion [57.03116054807942]
We introduce a novel approach for concept blending in pretrained text-to-image diffusion models.<n>We derive a robust algorithm for concept blending that capitalizes on the Markovian dynamics of the Black-Scholes framework.<n>Our work shows that financially inspired techniques can enhance text-to-image concept blending in generative AI.
arXiv Detail & Related papers (2024-05-22T14:25:57Z) - Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.<n>Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.<n>However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.<n>We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z) - DiffMorph: Text-less Image Morphing with Diffusion Models [0.0]
verb|DiffMorph| synthesizes images that mix concepts without the use of textual prompts.
verb|DiffMorph| takes an initial image with conditioning artist-drawn sketches to generate a morphed image.
We employ a pre-trained text-to-image diffusion model and fine-tune it to reconstruct each image faithfully.
arXiv Detail & Related papers (2024-01-01T12:42:32Z) - Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing
Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model.
We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance.
We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.