Populate-A-Scene: Affordance-Aware Human Video Generation
- URL: http://arxiv.org/abs/2507.00334v1
- Date: Tue, 01 Jul 2025 00:21:24 GMT
- Title: Populate-A-Scene: Affordance-Aware Human Video Generation
- Authors: Mengyi Shan, Zecheng He, Haoyu Ma, Felix Juefei-Xu, Peizhao Zhang, Tingbo Hou, Ching-Yao Chuang,
- Abstract summary: We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction.<n>We fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance.<n>An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.
- Score: 31.083046400077176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.
Related papers
- ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation [17.438484695828276]
We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis.<n>Our key insight is to distill human-scene interactions from state-of-the-art video generation models.<n>ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects.
arXiv Detail & Related papers (2024-12-24T18:55:38Z) - FIction: 4D Future Interaction Prediction from Video [63.37136159797888]
We introduce FIction for 4D future interaction prediction from videos.<n>Given an input video of a human activity, the goal is to predict which objects at what 3D locations the person will interact with in the next time period.
arXiv Detail & Related papers (2024-12-01T18:44:17Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - PixelHuman: Animatable Neural Radiance Fields from Few Images [27.932366091437103]
We propose PixelHuman, a novel rendering model that generates animatable human scenes from a few images of a person.
Our method differs from existing methods in that it can generalize to any input image for animatable human synthesis.
Our experiments show that our method achieves state-of-the-art performance in multiview and novel pose synthesis from few-shot images.
arXiv Detail & Related papers (2023-07-18T08:41:17Z) - Putting People in Their Place: Affordance-Aware Human Insertion into
Scenes [61.63825003487104]
We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes.
Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances.
Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition.
arXiv Detail & Related papers (2023-04-27T17:59:58Z) - Neural Novel Actor: Learning a Generalized Animatable Neural
Representation for Human Actors [98.24047528960406]
We propose a new method for learning a generalized animatable neural representation from a sparse set of multi-view imagery of multiple persons.
The learned representation can be used to synthesize novel view images of an arbitrary person from a sparse set of cameras, and further animate them with the user's pose control.
arXiv Detail & Related papers (2022-08-25T07:36:46Z) - NeuMan: Neural Human Radiance Field from a Single Video [26.7471970027198]
We train two NeRF models: a human NeRF model and a scene NeRF model.
Our method is able to learn subject specific details, including cloth wrinkles and accessories, from just a 10 seconds video clip.
arXiv Detail & Related papers (2022-03-23T17:35:50Z) - Hallucinating Pose-Compatible Scenes [55.064949607528405]
We present a large-scale generative adversarial network for pose-conditioned scene generation.
We curating a massive meta-dataset containing over 19 million frames of humans in everyday environments.
We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose.
arXiv Detail & Related papers (2021-12-13T18:59:26Z) - Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis [124.48519390371636]
Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications.
Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person.
This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
arXiv Detail & Related papers (2021-10-27T03:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.