Per-Query Visual Concept Learning
- URL: http://arxiv.org/abs/2508.09045v1
- Date: Tue, 12 Aug 2025 16:07:27 GMT
- Title: Per-Query Visual Concept Learning
- Authors: Ori Malca, Dvir Samuel, Gal Chechik,
- Abstract summary: We show that many existing methods can be substantially augmented by adding a personalization step.<n>Specifically, we leverage PDM features - previously designed to capture identity - and show how they can be used to improve semantic similarity.
- Score: 32.045160884721646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual concept learning, also known as Text-to-image personalization, is the process of teaching new concepts to a pretrained model. This has numerous applications from product placement to entertainment and personalized design. Here we show that many existing methods can be substantially augmented by adding a personalization step that is (1) specific to the prompt and noise seed, and (2) using two loss terms based on the self- and cross- attention, capturing the identity of the personalized concept. Specifically, we leverage PDM features - previously designed to capture identity - and show how they can be used to improve personalized semantic similarity. We evaluate the benefit that our method gains on top of six different personalization methods, and several base text-to-image models (both UNet- and DiT-based). We find significant improvements even over previous per-query personalization methods.
Related papers
- Multi-View Consistent Human Image Customization via In-Context Learning [62.83302682808891]
PersonalView is capable of enabling an existing model to acquire multi-view generation capability with as few as 100 training samples.<n>We evaluate the multi-view consistency, text alignment, identity similarity, and visual quality of PersonalView and compare it to recent baselines with potential capability of multi-view customization.
arXiv Detail & Related papers (2025-10-31T22:21:28Z) - Improving Personalized Search with Regularized Low-Rank Parameter Updates [52.29168893900888]
We show how to adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval.<n>We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion.<n>Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries.
arXiv Detail & Related papers (2025-06-11T21:15:21Z) - MagicFace: Training-free Universal-Style Human Image Customized Synthesis [13.944050414488911]
MagicFace is a training-free method for multi-concept universal-style human image personalized synthesis.
Our core idea is to simulate how humans create images given specific concepts, first establish a semantic layout.
In the first stage, RSA enables the latent image to query features from all reference concepts simultaneously.
arXiv Detail & Related papers (2024-08-14T10:08:46Z) - AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization [3.5066393042242123]
We propose AttenCraft, an attention-based method for multiple-concept disentanglement.<n>We introduce an adaptive algorithm based on attention scores to estimate sampling ratios for different concepts.<n>Our model effectively mitigates two issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.
arXiv Detail & Related papers (2024-05-28T08:50:14Z) - Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models [66.05234562835136]
We present MuDI, a novel framework that enables multi-subject personalization.
Our main idea is to utilize segmented subjects generated by a foundation model for segmentation.
Experimental results show that our MuDI can produce high-quality personalized images without identity mixing.
arXiv Detail & Related papers (2024-04-05T17:45:22Z) - OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization [9.552325786494334]
We introduce a novel parameter-efficient one-shot fine-tuning method for personalized text-to-image (T2I) personalization.
A novel hypernetwork-powered attribute-focused fine-tuning mechanism is employed to achieve the precise learning of various attribute features.
Our method shows significant superiority in attribute identification and application, as well as achieves a good balance between efficiency and output quality.
arXiv Detail & Related papers (2024-03-17T01:42:48Z) - Gen4Gen: Generative Data Pipeline for Generative Multi-Concept
Composition [47.07564907486087]
Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts.
This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models.
arXiv Detail & Related papers (2024-02-23T18:55:09Z) - Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.<n>Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.<n>However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.<n>We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z) - CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image
Personalization [56.892032386104006]
CatVersion is an inversion-based method that learns the personalized concept through a handful of examples.
Users can utilize text prompts to generate images that embody the personalized concept.
arXiv Detail & Related papers (2023-11-24T17:55:10Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization.
We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain.
Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.