Related papers: SPF-Portrait: Towards Pure Text-to-Portrait Customization with Semantic Pollution-Free Fine-Tuning

SPF-Portrait: Towards Pure Text-to-Portrait Customization with Semantic Pollution-Free Fine-Tuning

URL: http://arxiv.org/abs/2504.00396v3
Date: Tue, 27 May 2025 13:43:25 GMT
Title: SPF-Portrait: Towards Pure Text-to-Portrait Customization with Semantic Pollution-Free Fine-Tuning
Authors: Xiaole Xian, Zhichao Liao, Qingyu Li, Wenyu Qin, Pengfei Wan, Weicheng Xie, Long Zeng, Linlin Shen, Pingfa Feng,
Abstract summary: SPF-Portrait is a pioneering work to purely understand customized target semantics and minimize disruption to the original model.<n>In our SPF-Portrait, we design a dual-path contrastive learning pipeline, which introduces the original model as a behavioral alignment reference.<n>It adaptively balances the behavioral alignment across different regions and the responsiveness of the target semantics.
Score: 33.709835660394305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning a pre-trained Text-to-Image (T2I) model on a tailored portrait dataset is the mainstream method for text-to-portrait customization. However, existing methods often severely impact the original model's behavior (e.g., changes in ID, layout, etc.) while customizing portrait attributes. To address this issue, we propose SPF-Portrait, a pioneering work to purely understand customized target semantics and minimize disruption to the original model. In our SPF-Portrait, we design a dual-path contrastive learning pipeline, which introduces the original model as a behavioral alignment reference for the conventional fine-tuning path. During the contrastive learning, we propose a novel Semantic-Aware Fine Control Map that indicates the intensity of response regions of the target semantics, to spatially guide the alignment process between the contrastive paths. It adaptively balances the behavioral alignment across different regions and the responsiveness of the target semantics. Furthermore, we propose a novel response enhancement mechanism to reinforce the presentation of target semantics, while mitigating representation discrepancy inherent in direct cross-modal supervision. Through the above strategies, we achieve incremental learning of customized target semantics for pure text-to-portrait customization. Extensive experiments show that SPF-Portrait achieves state-of-the-art performance. Project page: https://spf-portrait.github.io/SPF-Portrait/

Related papers

Decouple before Align: Visual Disentanglement Enhances Prompt Tuning [85.91474962071452]
Prompt tuning (PT) has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models.<n>This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context.<n>We propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept.
arXiv Detail & Related papers (2025-08-01T07:46:00Z)
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis [46.502962768034166]
Zero-shot Referring Image identifies the instance mask that best aligns with a referring expression without training and fine-tuning.<n>Previous CLIP-based models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects.<n>IteRPrimE outperforms previous state-of-the-art zero-shot methods, particularly excelling in out-of-domain scenarios.
arXiv Detail & Related papers (2025-03-02T15:19:37Z)
Diffusion-Based Conditional Image Editing through Optimized Inference with Guidance [46.922018440110826]
We present a training-free approach for text-driven image-to-image translation based on a pretrained text-to-image diffusion model.<n>Our method achieves outstanding image-to-image translation performance on various tasks when combined with the pretrained Stable Diffusion model.
arXiv Detail & Related papers (2024-12-20T11:15:31Z)
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models. We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models [67.68871360210208]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency.<n>We propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models.<n>We show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
arXiv Detail & Related papers (2024-02-19T09:52:41Z)
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization [31.960807999301196]
We propose a novel plug-in method, called DreamMatcher, which reformulates T2I personalization as semantic matching. Specifically, DreamMatcher replaces the target values with reference values aligned by semantic matching, while leaving the structure path unchanged. We also introduce a semantic-consistent masking strategy to isolate the personalized concept from irrelevant regions introduced by the target prompts.
arXiv Detail & Related papers (2024-02-15T09:21:16Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
SPColor: Semantic Prior Guided Exemplar-based Image Colorization [14.191819767895867]
We propose SPColor, a semantic prior guided-based image colorization framework. SPColor first coarsely classifies pixels of the reference and target images to several pseudo-classes under the guidance of semantic prior. Our model outperforms recent state-of-the-art methods both quantitatively and qualitatively on public dataset.
arXiv Detail & Related papers (2023-04-13T04:21:45Z)
Eliminating Contextual Prior Bias for Semantic Image Editing via Dual-Cycle Diffusion [35.95513392917737]
A novel approach called Dual-Cycle Diffusion generates an unbiased mask to guide image editing. Our experiments demonstrate the effectiveness of the proposed method, as it significantly improves the D-CLIP score from 0.272 to 0.283.
arXiv Detail & Related papers (2023-02-05T14:30:22Z)
Diffusion-based Image Translation using Disentangled Style and Content Representation [51.188396199083336]
Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer. It is often difficult to maintain the original content of the image during the reverse diffusion. We present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
arXiv Detail & Related papers (2022-09-30T06:44:37Z)
Learning Diverse Tone Styles for Image Retouching [73.60013618215328]
We propose to learn diverse image retouching with normalizing flow-based architectures. A joint-training pipeline is composed of a style encoder, a conditional RetouchNet, and the image tone style normalizing flow (TSFlow) module. Our proposed method performs favorably against state-of-the-art methods and is effective in generating diverse results.
arXiv Detail & Related papers (2022-07-12T09:49:21Z)
Marginal Contrastive Correspondence for Guided Image Generation [58.0605433671196]
Exemplar-based image translation establishes dense correspondences between a conditional input and an exemplar from two different domains. Existing work builds the cross-domain correspondences implicitly by minimizing feature-wise distances across the two domains. We design a Marginal Contrastive Learning Network (MCL-Net) that explores contrastive learning to learn domain-invariant features for realistic exemplar-based image translation.
arXiv Detail & Related papers (2022-04-01T13:55:44Z)
Learning Pixel-Adaptive Weights for Portrait Photo Retouching [1.9843222704723809]
Portrait photo retouching is a photo retouching task that emphasizes human-region priority and group-level consistency. In this paper, we model local context cues to improve the retouching quality explicitly. Experiments on PPR10K dataset verify the effectiveness of our method.
arXiv Detail & Related papers (2021-12-07T07:23:42Z)
Global and Local Alignment Networks for Unpaired Image-to-Image Translation [170.08142745705575]
The goal of unpaired image-to-image translation is to produce an output image reflecting the target domain's style. Due to the lack of attention to the content change in existing methods, semantic information from source images suffers from degradation during translation. We introduce a novel approach, Global and Local Alignment Networks (GLA-Net) Our method effectively generates sharper and more realistic images than existing approaches.
arXiv Detail & Related papers (2021-11-19T18:01:54Z)
DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation [97.74059510314554]
Unsupervised domain adaptation (UDA) for semantic segmentation aims to adapt a segmentation model trained on the labeled source domain to the unlabeled target domain. Existing methods try to learn domain invariant features while suffering from large domain gaps. We propose a novel Dual Soft-Paste (DSP) method in this paper.
arXiv Detail & Related papers (2021-07-20T16:22:40Z)
Controllable Person Image Synthesis with Spatially-Adaptive Warped Normalization [72.65828901909708]
Controllable person image generation aims to produce realistic human images with desirable attributes. We introduce a novel Spatially-Adaptive Warped Normalization (SAWN), which integrates a learned flow-field to warp modulation parameters. We propose a novel self-training part replacement strategy to refine the pretrained model for the texture-transfer task.
arXiv Detail & Related papers (2021-05-31T07:07:44Z)
Learning Semantic Person Image Generation by Region-Adaptive Normalization [81.52223606284443]
We propose a new two-stage framework to handle the pose and appearance translation. In the first stage, we predict the target semantic parsing maps to eliminate the difficulties of pose transfer. In the second stage, we suggest a new person image generation method by incorporating the region-adaptive normalization.
arXiv Detail & Related papers (2021-04-14T06:51:37Z)
Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for Unsupervised Domain Adaptation in Segmentation [15.428323201750144]
BiSIDA employs consistency regularization to efficiently exploit information from the unlabeled target dataset. BiSIDA achieves new state-of-the-art on two commonly-used synthetic-to-real domain adaptation benchmarks.
arXiv Detail & Related papers (2020-09-18T03:26:44Z)
Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network. We explicitly estimate the foreground regions to suppress the effect of background clutter. We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.