IC-Custom: Diverse Image Customization via In-Context Learning
- URL: http://arxiv.org/abs/2507.01926v1
- Date: Wed, 02 Jul 2025 17:36:38 GMT
- Title: IC-Custom: Diverse Image Customization via In-Context Learning
- Authors: Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan,
- Abstract summary: IC-Custom is a unified framework that seamlessly integrates position-aware and position-free image customization.<n>It supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization.<n>It achieves 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters.
- Score: 72.92059781700594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We introduce the In-context Multi-Modal Attention (ICMA) mechanism with learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to correctly handle different task types and distinguish various inputs in polyptych configurations. To bridge the data gap, we carefully curated a high-quality dataset of 12k identity-consistent samples with 8k from real-world sources and 4k from high-quality synthetic data, avoiding the overly glossy and over-saturated synthetic appearance. IC-Custom supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves approximately 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom
Related papers
- DreamO: A Unified Framework for Image Customization [23.11440970488944]
We present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions.<n>Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types.<n>We employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data.
arXiv Detail & Related papers (2025-04-23T17:41:44Z) - Generating Multi-Image Synthetic Data for Text-to-Image Customization [48.59231755159313]
Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings.<n>Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision.<n>We propose a simple approach that addresses both limitations.
arXiv Detail & Related papers (2025-02-03T18:59:41Z) - LoRACLR: Contrastive Adaptation for Customization of Diffusion Models [62.70911549650579]
LoRACLR is a novel approach for multi-concept image generation that merges multiple LoRA models, each fine-tuned for a distinct concept, into a single, unified model.<n>LoRACLR uses a contrastive objective to align and merge the weight spaces of these models, ensuring compatibility while minimizing interference.<n>Our results highlight the effectiveness of LoRACLR in accurately merging multiple concepts, advancing the capabilities of personalized image generation.
arXiv Detail & Related papers (2024-12-12T18:59:55Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [28.345828491336874]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users.<n>We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image.<n>In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - Orthogonal Adaptation for Modular Customization of Diffusion Models [39.62438974450659]
We address a new problem called Modular Customization, with the goal of efficiently merging customized models.<n>We introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning.<n>Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture.
arXiv Detail & Related papers (2023-12-05T02:17:48Z) - A Simple and Robust Framework for Cross-Modality Medical Image
Segmentation applied to Vision Transformers [0.0]
We propose a simple framework to achieve fair image segmentation of multiple modalities using a single conditional model.
We show that our framework outperforms other cross-modality segmentation methods on the Multi-Modality Whole Heart Conditional Challenge.
arXiv Detail & Related papers (2023-10-09T09:51:44Z) - MuMIC -- Multimodal Embedding for Multi-label Image Classification with
Tempered Sigmoid [1.1452732046200158]
Multimodal learning approaches have recently achieved outstanding results in image representation and single-label image classification.
We propose the Multimodal Multi-label Image Classification (MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based Binary Cross Entropy loss function.
MuMIC is capable of providing high classification performance, handling real-world noisy data, supporting zero-shot predictions, and producing domain-specific image embeddings.
arXiv Detail & Related papers (2022-11-02T17:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.