Related papers: Piece it Together: Part-Based Concepting with IP-Priors

Piece it Together: Part-Based Concepting with IP-Priors

URL: http://arxiv.org/abs/2503.10365v1
Date: Thu, 13 Mar 2025 13:46:10 GMT
Title: Piece it Together: Part-Based Concepting with IP-Priors
Authors: Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or,
Abstract summary: We introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition.<n>Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+.<n>We also present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task.
Score: 52.01640707131325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

Related papers

CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design [6.830055289299306]
CAL-RAG is a retrieval-augmented, agentic framework for content-aware layout generation.<n>We implement our framework using LangGraph and evaluate it on a benchmark rich in semantic variability.<n>Results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution.
arXiv Detail & Related papers (2025-06-27T06:09:56Z)
Training Free Stylized Abstraction [27.307331773270676]
Stylized abstraction synthesizes visually exaggerated yet semantically faithful representations of subjects, balancing recognizability with perceptual distortion.<n>We propose a training-free framework that generates stylized abstractions from a single image using inference-time scaling in vision-language models (VLLMs)<n>Our method adapts structural restoration dynamically through style-aware temporal scheduling, enabling high-fidelity reconstructions that honor both subject and style.
arXiv Detail & Related papers (2025-05-28T17:59:57Z)
Zero-Shot Visual Concept Blending Without Text Guidance [0.0]
"Visual Concept Blending" provides fine-grained control over which features from multiple reference images are transferred to a source image. Our method enables the flexible transfer of texture, shape, motion, style, and more abstract conceptual transformations.
arXiv Detail & Related papers (2025-03-27T08:56:33Z)
Object-centric Binding in Contrastive Language-Image Pretraining [9.376583779399834]
We propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations.<n>Our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives.<n>Our resulting model paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
arXiv Detail & Related papers (2025-02-19T21:30:51Z)
IP-Composer: Semantic Composition of Visual Concepts [49.18472621931207]
We present IP-Composer, a training-free approach for compositional image generation.<n>Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding.<n>We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text.
arXiv Detail & Related papers (2025-02-19T18:49:31Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
PartCraft: Crafting Creative Objects by Parts [128.30514851911218]
This paper propels creative control in generative visual AI by allowing users to "select" We for the first time allow users to choose visual concepts by parts for their creative endeavors. Fine-grained generation that precisely captures selected visual concepts.
arXiv Detail & Related papers (2024-07-05T15:53:04Z)
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts. We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z)
Unsupervised Learning of Compositional Energy Concepts [70.11673173291426]
We propose COMET, which discovers and represents concepts as separate energy functions. Comet represents both global concepts as well as objects under a unified framework.
arXiv Detail & Related papers (2021-11-04T17:46:12Z)
SAFCAR: Structured Attention Fusion for Compositional Action Recognition [47.43959215267547]
We develop and test a novel Structured Attention Fusion (SAF) self-attention mechanism to combine information from object detections. We show that our approach recognizes novel verb-noun compositions more effectively than current state of the art systems. We validate our approach on the challenging Something-Else tasks from the Something-Something-V2 dataset.
arXiv Detail & Related papers (2020-12-03T17:45:01Z)
Visual Concept Reasoning Networks [93.99840807973546]
A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. We propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts. Our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
arXiv Detail & Related papers (2020-08-26T20:02:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.