Related papers: Text2Place: Affordance-aware Text Guided Human Placement

Text2Place: Affordance-aware Text Guided Human Placement

URL: http://arxiv.org/abs/2407.15446v1
Date: Mon, 22 Jul 2024 08:00:06 GMT
Title: Text2Place: Affordance-aware Text Guided Human Placement
Authors: Rishubh Parihar, Harsh Gupta, Sachidanand VS, R. Venkatesh Babu,
Abstract summary: This work tackles the problem of realistic human insertion in a given background scene termed as textbfSemantic Human Placement. For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models. The proposed method can generate highly realistic scene compositions while preserving the background and subject identity.
Score: 26.041917073228483
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: For a given scene, humans can easily reason for the locations and pose to place objects. Designing a computational model to reason about these affordances poses a significant challenge, mirroring the intuitive reasoning abilities of humans. This work tackles the problem of realistic human insertion in a given background scene termed as \textbf{Semantic Human Placement}. This task is extremely challenging given the diverse backgrounds, scale, and pose of the generated person and, finally, the identity preservation of the person. We divide the problem into the following two stages \textbf{i)} learning \textit{semantic masks} using text guidance for localizing regions in the image to place humans and \textbf{ii)} subject-conditioned inpainting to place a given subject adhering to the scene affordance within the \textit{semantic masks}. For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models and optimize a novel parameterization of the semantic mask, eliminating the need for large-scale training. To the best of our knowledge, we are the first ones to provide an effective solution for realistic human placements in diverse real-world scenes. The proposed method can generate highly realistic scene compositions while preserving the background and subject identity. Further, we present results for several downstream tasks - scene hallucination from a single or multiple generated persons and text-based attribute editing. With extensive comparisons against strong baselines, we show the superiority of our method in realistic human placement.

Related papers

Learning Complex Non-Rigid Image Edits from Multimodal Conditioning [18.500715348636582]
We focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects.
arXiv Detail & Related papers (2024-12-13T15:41:08Z)
Generating Human Motion in 3D Scenes from Text Descriptions [60.04976442328767]
This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. We propose a new approach that decomposes the complex problem into two more manageable sub-problems. For language grounding of the target object, we leverage the power of large language models; for motion generation, we design an object-centric scene representation.
arXiv Detail & Related papers (2024-05-13T14:30:12Z)
FlashFace: Human Image Personalization with High-fidelity Identity Preservation [59.76645602354481]
FlashFace allows users to easily personalize their own photos by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following.
arXiv Detail & Related papers (2024-03-25T17:59:57Z)
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language. Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z)
Putting People in Their Place: Affordance-Aware Human Insertion into Scenes [61.63825003487104]
We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition.
arXiv Detail & Related papers (2023-04-27T17:59:58Z)
Global Context-Aware Person Image Generation [24.317541784957285]
We propose a data-driven approach for context-aware person image generation. In our method, the position, scale, and appearance of the generated person are semantically conditioned on the existing persons in the scene.
arXiv Detail & Related papers (2023-02-28T16:34:55Z)
HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on. We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z)
Long-term Human Motion Prediction with Scene Context [60.096118270451974]
We propose a novel three-stage framework for predicting human motion. Our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path.
arXiv Detail & Related papers (2020-07-07T17:59:53Z)
Wish You Were Here: Context-Aware Human Generation [100.51309746913512]
We present a novel method for inserting objects, specifically humans, into existing images. Our method involves threeworks: the first generates the semantic map of the new person, given the pose of the other persons in the scene. The second network renders the pixels of the novel person and its blending mask, based on specifications in the form of multiple appearance components. A third network refines the generated face in order to match those of the target person.
arXiv Detail & Related papers (2020-05-21T14:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.