Learning Complex Non-Rigid Image Edits from Multimodal Conditioning
- URL: http://arxiv.org/abs/2412.10219v1
- Date: Fri, 13 Dec 2024 15:41:08 GMT
- Title: Learning Complex Non-Rigid Image Edits from Multimodal Conditioning
- Authors: Nikolai Warner, Jack Kolb, Meera Hahn, Vighnesh Birodkar, Jonathan Huang, Irfan Essa,
- Abstract summary: We focus on inserting a given human (specifically, a single image of a person) into a novel scene.
Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose.
We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects.
- Score: 18.500715348636582
- License:
- Abstract: In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.
Related papers
- PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation [38.958695275774616]
We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities.
We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation.
arXiv Detail & Related papers (2024-09-10T14:09:39Z) - Text2Place: Affordance-aware Text Guided Human Placement [26.041917073228483]
This work tackles the problem of realistic human insertion in a given background scene termed as textbfSemantic Human Placement.
For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models.
The proposed method can generate highly realistic scene compositions while preserving the background and subject identity.
arXiv Detail & Related papers (2024-07-22T08:00:06Z) - UniHuman: A Unified Model for Editing Human Images in the Wild [49.896715833075106]
We propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings.
To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders.
In user studies, UniHuman is preferred by the users in an average of 77% of cases.
arXiv Detail & Related papers (2023-12-22T05:00:30Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Pose Guided Multi-person Image Generation From Text [15.15576618501609]
Existing methods struggle to create high fidelity full-body images, especially multiple people.
We propose a pose-guided text-to-image model, using pose as an additional input constraint.
We show results on the Deepfashion dataset and create a new multi-person Deepfashion dataset to demonstrate the multi-capabilities of our approach.
arXiv Detail & Related papers (2022-03-09T17:38:03Z) - Hallucinating Pose-Compatible Scenes [55.064949607528405]
We present a large-scale generative adversarial network for pose-conditioned scene generation.
We curating a massive meta-dataset containing over 19 million frames of humans in everyday environments.
We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose.
arXiv Detail & Related papers (2021-12-13T18:59:26Z) - Who's Waldo? Linking People Across Text and Images [56.40556801773923]
We present a task and benchmark dataset for person-centric visual grounding.
Our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues.
We propose a Transformer-based method that outperforms several strong baselines on this task.
arXiv Detail & Related papers (2021-08-16T17:36:49Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Wish You Were Here: Context-Aware Human Generation [100.51309746913512]
We present a novel method for inserting objects, specifically humans, into existing images.
Our method involves threeworks: the first generates the semantic map of the new person, given the pose of the other persons in the scene.
The second network renders the pixels of the novel person and its blending mask, based on specifications in the form of multiple appearance components.
A third network refines the generated face in order to match those of the target person.
arXiv Detail & Related papers (2020-05-21T14:09:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.