SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation
- URL: http://arxiv.org/abs/2510.07340v1
- Date: Tue, 07 Oct 2025 18:01:55 GMT
- Title: SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation
- Authors: Yongzhi Li, Saining Zhang, Yibing Chen, Boying Li, Yanxin Zhang, Xiaoyu Du,
- Abstract summary: SpotDiff is a novel learning-based method that extracts subject-specific features by spotting and disentangling interference.<n>To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations.<n>Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.
- Score: 6.116573441311417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personalized image generation aims to faithfully preserve a reference subject's identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.
Related papers
- FocusDPO: Dynamic Preference Optimization for Multi-Subject Personalized Image Generation via Adaptive Focus [10.615833390806486]
Multi-subject personalized image generation aims to synthesize customized images containing multiple specified subjects without requiring test-time optimization.<n>We present FocusDPO, a framework that adaptively identifies focus regions based on dynamic semantic correspondence and supervision image complexity.
arXiv Detail & Related papers (2025-09-01T07:06:36Z) - HAMLET-FFD: Hierarchical Adaptive Multi-modal Learning Embeddings Transformation for Face Forgery Detection [6.060036926093259]
HAMLET-FFD is a cross-domain generalization framework for face forgery detection.<n>It integrates visual evidence with conceptual cues, emulating expert forensic analysis.<n>By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin.
arXiv Detail & Related papers (2025-07-28T15:09:52Z) - FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation [25.925198876189057]
FreeGraftor is a training-free framework that addresses limitations through cross-image feature grafting.<n>Our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment.
arXiv Detail & Related papers (2025-04-22T14:55:23Z) - Pose-Transformation and Radial Distance Clustering for Unsupervised Person Re-identification [5.522856885199346]
Person re-identification (re-ID) aims to tackle the problem of matching identities across non-overlapping cameras.
Supervised approaches require identity information that may be difficult to obtain and are inherently biased towards the dataset they are trained on.
We propose an unsupervised approach to the person re-ID setup. Having zero knowledge of true labels, our proposed method enhances the discriminating ability of the learned features.
arXiv Detail & Related papers (2024-11-06T20:55:30Z) - TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models.
We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization.
Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z) - Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [88.398085358514]
Contrastive Deepfake Embeddings (CoDE) is a novel embedding space specifically designed for deepfake detection.
CoDE is trained via contrastive learning by additionally enforcing global-local similarities.
arXiv Detail & Related papers (2024-07-29T18:00:10Z) - Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition [36.55724380184354]
We propose CLTD, a discriminative feature learning module designed to eliminate the influence of confounders in triple domains, ie, spatial, temporal, and spectral.
Specifically, we utilize the Cross Pixel-wise Attention Generator (CPAG) to generate attention distributions for factual and counterfactual features in spatial and temporal domains.
Then, we introduce the Fourier Projection Head (FPH) to project spatial features into the spectral space, which preserves essential information while reducing computational costs.
arXiv Detail & Related papers (2024-07-17T12:16:44Z) - Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [60.20318058777603]
Generalizable vehicle re-identification (ReID) seeks to develop models that can adapt to unknown target domains without the need for fine-tuning or retraining.<n>Previous works have mainly focused on extracting domain-invariant features by aligning data distributions between source domains.<n>We propose a two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method to solve this unique problem.
arXiv Detail & Related papers (2024-07-10T04:06:39Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - CoDo: Contrastive Learning with Downstream Background Invariance for
Detection [10.608660802917214]
We propose a novel object-level self-supervised learning method, called Contrastive learning with Downstream background invariance (CoDo)
The pretext task is converted to focus on instance location modeling for various backgrounds, especially for downstream datasets.
Experiments on MSCOCO demonstrate that the proposed CoDo with common backbones, ResNet50-FPN, yields strong transfer learning results for object detection.
arXiv Detail & Related papers (2022-05-10T01:26:15Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.