U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
- URL: http://arxiv.org/abs/2403.20231v1
- Date: Fri, 29 Mar 2024 15:20:34 GMT
- Title: U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
- Authors: You Wu, Kean Liu, Xiaoyue Mi, Fan Tang, Juan Cao, Jintao Li,
- Abstract summary: State-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space.
A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes.
Experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts.
- Score: 18.841473623776153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Concept personalization methods enable large text-to-image models to learn specific subjects (e.g., objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes, state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study, we proposed a more challenging setting, namely fine-grained visual appearance personalization. Different from existing methods, we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes. These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage, adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts, thus improving the controllability and flexibility of personalization.
Related papers
- DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation [22.599542105037443]
DisEnvisioner is a novel approach for effectively extracting and enriching the subject-essential features while filtering out -irrelevant information.
Specifically, the feature of the subject and other irrelevant components are effectively separated into distinctive visual tokens, enabling a much more accurate customization.
Experiments demonstrate the superiority of our approach over existing methods in instruction response (editability), ID consistency, inference speed, and the overall image quality.
arXiv Detail & Related papers (2024-10-02T22:29:14Z) - Data Augmentation via Latent Diffusion for Saliency Prediction [67.88936624546076]
Saliency prediction models are constrained by the limited diversity and quantity of labeled data.
We propose a novel data augmentation method for deep saliency prediction that edits natural images while preserving the complexity and variability of real-world scenes.
arXiv Detail & Related papers (2024-09-11T14:36:24Z) - ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling [32.55352435358949]
We propose a sentence generation-based retrieval formulation for attribute recognition.
For each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence.
We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets.
arXiv Detail & Related papers (2024-08-07T21:44:29Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization [9.552325786494334]
We introduce a novel parameter-efficient one-shot fine-tuning method for personalized text-to-image (T2I) personalization.
A novel hypernetwork-powered attribute-focused fine-tuning mechanism is employed to achieve the precise learning of various attribute features.
Our method shows significant superiority in attribute identification and application, as well as achieves a good balance between efficiency and output quality.
arXiv Detail & Related papers (2024-03-17T01:42:48Z) - Multi-modal Attribute Prompting for Vision-Language Models [40.39559705414497]
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios.
Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics.
We propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment.
arXiv Detail & Related papers (2024-03-01T01:28:10Z) - Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image
Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods.
The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Parts of Speech-Grounded Subspaces in Vision-Language Models [32.497303059356334]
We propose to separate representations of different visual modalities in CLIP's joint vision-language space.
We learn subspaces capturing variability corresponding to a specific part of speech, while minimising variability to the rest.
We show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances.
arXiv Detail & Related papers (2023-05-23T13:32:19Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.