Visual Persona: Foundation Model for Full-Body Human Customization
- URL: http://arxiv.org/abs/2503.15406v2
- Date: Mon, 24 Mar 2025 07:28:09 GMT
- Title: Visual Persona: Foundation Model for Full-Body Human Customization
- Authors: Jisu Nam, Soowon Son, Zhan Xu, Jing Shi, Difan Liu, Feng Liu, Aashish Misraa, Seungryong Kim, Yang Zhou,
- Abstract summary: We introduce Visual Persona, a model for text-to-image full-body human customization.<n>Our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations.<n>Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs.
- Score: 36.135949939650786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.
Related papers
- Controllable Human Image Generation with Personalized Multi-Garments [46.042383679103125]
BootComp is a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments.
We propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs.
We show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain.
arXiv Detail & Related papers (2024-11-25T12:37:13Z) - Imagine yourself: Tuning-Free Personalized Image Generation [39.63411174712078]
We introduce Imagine yourself, a state-of-the-art model designed for personalized image generation.
It operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments.
Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment.
arXiv Detail & Related papers (2024-09-20T09:21:49Z) - Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image
Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods.
The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z) - UniHuman: A Unified Model for Editing Human Images in the Wild [49.896715833075106]
We propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings.
To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders.
In user studies, UniHuman is preferred by the users in an average of 77% of cases.
arXiv Detail & Related papers (2023-12-22T05:00:30Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - MyStyle: A Personalized Generative Prior [38.3436972491162]
We introduce MyStyle, a personalized deep generative prior trained with a few shots of an individual.
MyStyle allows to reconstruct, enhance and edit images of a specific person.
arXiv Detail & Related papers (2022-03-31T17:59:19Z) - Personalized visual encoding model construction with small data [1.6799377888527687]
We propose and test an alternative personalized ensemble encoding model approach to utilize existing encoding models.
We show that these personalized ensemble encoding models trained with small amounts of data for a specific individual.
Importantly, the personalized ensemble encoding models preserve patterns of inter-individual variability in the image-response relationship.
arXiv Detail & Related papers (2022-02-04T17:24:50Z) - Generating Person Images with Appearance-aware Pose Stylizer [66.44220388377596]
We present a novel end-to-end framework to generate realistic person images based on given person poses and appearances.
The core of our framework is a novel generator called Appearance-aware Pose Stylizer (APS) which generates human images by coupling the target pose with the conditioned person appearance progressively.
arXiv Detail & Related papers (2020-07-17T15:58:05Z) - Pose Manipulation with Identity Preservation [0.0]
We introduce Character Adaptive Identity Normalization GAN (CainGAN) which uses spatial characteristic features extracted by an embedder and combined across source images.
CainGAN receives figures of faces from a certain individual and produces new ones while preserving the person's identity.
Experimental results show that the quality of generated images scales with the size of the input set used during inference.
arXiv Detail & Related papers (2020-04-20T09:51:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.