Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods
- URL: http://arxiv.org/abs/2312.06116v1
- Date: Mon, 11 Dec 2023 04:47:39 GMT
- Title: Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods
- Authors: Panos Achlioptas, Alexandros Benetatos, Iordanis Fostiropoulos,
Dimitris Skourtis
- Abstract summary: We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
- Score: 52.806258774051216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we systematically study the problem of personalized
text-to-image generation, where the output image is expected to portray
information about specific human subjects. E.g., generating images of oneself
appearing at imaginative places, interacting with various items, or engaging in
fictional activities. To this end, we focus on text-to-image systems that input
a single image of an individual to ground the generation process along with
text describing the desired visual context. Our first contribution is to fill
the literature gap by curating high-quality, appropriate data for this task.
Namely, we introduce a standardized dataset (Stellar) that contains
personalized prompts coupled with images of individuals that is an order of
magnitude larger than existing relevant datasets and where rich semantic
ground-truth annotations are readily available. Having established Stellar to
promote cross-systems fine-grained comparisons further, we introduce a rigorous
ensemble of specialized metrics that highlight and disentangle fundamental
properties such systems should obey. Besides being intuitive, our new metrics
correlate significantly more strongly with human judgment than currently used
metrics on this task. Last but not least, drawing inspiration from the recent
works of ELITE and SDXL, we derive a simple yet efficient, personalized
text-to-image baseline that does not require test-time fine-tuning for each
subject and which sets quantitatively and in human trials a new SoTA. For more
information, please visit our project's website:
https://stellar-gen-ai.github.io/.
Related papers
- SDFD: Building a Versatile Synthetic Face Image Dataset with Diverse Attributes [14.966767182001755]
We propose a methodology for generating synthetic face image datasets that capture a broader spectrum of facial diversity.
Specifically, our approach integrates demographics and biometrics but also non-permanent traits like make-up, hairstyle, and accessories.
These prompts guide a state-of-the-art text-to-image model in generating a comprehensive dataset of high-quality realistic images.
arXiv Detail & Related papers (2024-04-26T08:51:31Z) - InstructBooth: Instruction-following Personalized Text-to-Image
Generation [30.89054609185801]
InstructBooth is a novel method designed to enhance image-text alignment in personalized text-to-image models.
Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier.
After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment.
arXiv Detail & Related papers (2023-12-04T20:34:46Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Highly Personalized Text Embedding for Image Manipulation by Stable
Diffusion [34.662798793560995]
We present a simple yet highly effective approach to personalization using highly personalized (PerHi) text embedding.
Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text.
arXiv Detail & Related papers (2023-03-15T17:07:45Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Enhancing Social Relation Inference with Concise Interaction Graph and
Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE)
It concisely learns interactive features of persons and discriminative features of holistic scenes.
PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.