Related papers: Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

URL: http://arxiv.org/abs/2506.24085v2
Date: Mon, 14 Jul 2025 13:42:45 GMT
Title: Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention
Authors: Wonwoong Cho, Yanxia Zhang, Yan-Ying Chen, David I. Inouye,
Abstract summary: Cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation.<n>We propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity.
Score: 11.686174382596667
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.

Related papers

Blending Concepts with Text-to-Image Diffusion Models [48.68800153838679]
Diffusion models have advanced text-to-image generation in recent years, translating abstract concepts into high-fidelity images with remarkable ease.<n>In this work, we examine whether they can also blend distinct concepts, ranging from concrete objects to intangible ideas, into coherent new visual entities under a zero-shot framework.<n>We show that modern diffusion models indeed exhibit creative blending capabilities without further training or fine-tuning.
arXiv Detail & Related papers (2025-06-30T08:53:30Z)
IP-Composer: Semantic Composition of Visual Concepts [49.18472621931207]
We present IP-Composer, a training-free approach for compositional image generation.<n>Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding.<n>We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text.
arXiv Detail & Related papers (2025-02-19T18:49:31Z)
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control [8.685610154314459]
diffusion models show extraordinary talents in text-to-image generation, but they may still fail to generate highly aesthetic images.<n>We propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter.<n>Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method.
arXiv Detail & Related papers (2024-12-30T08:47:25Z)
OmniPrism: Learning Disentangled Visual Concept for Image Generation [57.21097864811521]
Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes.<n>We propose OmniPrism, a visual concept disentangling approach for creative image generation.<n>Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts.
arXiv Detail & Related papers (2024-12-16T18:59:52Z)
Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z)
Financial Models in Generative Art: Black-Scholes-Inspired Concept Blending in Text-to-Image Diffusion [57.03116054807942]
We introduce a novel approach for concept blending in pretrained text-to-image diffusion models.<n>We derive a robust algorithm for concept blending that capitalizes on the Markovian dynamics of the Black-Scholes framework.<n>Our work shows that financially inspired techniques can enhance text-to-image concept blending in generative AI.
arXiv Detail & Related papers (2024-05-22T14:25:57Z)
Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.<n>Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.<n>However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.<n>We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z)
DiffMorph: Text-less Image Morphing with Diffusion Models [0.0]
verb|DiffMorph| synthesizes images that mix concepts without the use of textual prompts. verb|DiffMorph| takes an initial image with conditioning artist-drawn sketches to generate a morphed image. We employ a pre-trained text-to-image diffusion model and fine-tune it to reconstruct each image faithfully.
arXiv Detail & Related papers (2024-01-01T12:42:32Z)
Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model. We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance. We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.