Related papers: Populate-A-Scene: Affordance-Aware Human Video Generation

Populate-A-Scene: Affordance-Aware Human Video Generation

URL: http://arxiv.org/abs/2507.00334v1
Date: Tue, 01 Jul 2025 00:21:24 GMT
Title: Populate-A-Scene: Affordance-Aware Human Video Generation
Authors: Mengyi Shan, Zecheng He, Haoyu Ma, Felix Juefei-Xu, Peizhao Zhang, Tingbo Hou, Ching-Yao Chuang,
Abstract summary: We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction.<n>We fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance.<n>An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.
Score: 31.083046400077176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.

Related papers

ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation [17.438484695828276]
We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis.<n>Our key insight is to distill human-scene interactions from state-of-the-art video generation models.<n>ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects.
arXiv Detail & Related papers (2024-12-24T18:55:38Z)
FIction: 4D Future Interaction Prediction from Video [63.37136159797888]
We introduce FIction for 4D future interaction prediction from videos.<n>Given an input video of a human activity, the goal is to predict which objects at what 3D locations the person will interact with in the next time period.
arXiv Detail & Related papers (2024-12-01T18:44:17Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
PixelHuman: Animatable Neural Radiance Fields from Few Images [27.932366091437103]
We propose PixelHuman, a novel rendering model that generates animatable human scenes from a few images of a person. Our method differs from existing methods in that it can generalize to any input image for animatable human synthesis. Our experiments show that our method achieves state-of-the-art performance in multiview and novel pose synthesis from few-shot images.
arXiv Detail & Related papers (2023-07-18T08:41:17Z)
Putting People in Their Place: Affordance-Aware Human Insertion into Scenes [61.63825003487104]
We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition.
arXiv Detail & Related papers (2023-04-27T17:59:58Z)
Neural Novel Actor: Learning a Generalized Animatable Neural Representation for Human Actors [98.24047528960406]
We propose a new method for learning a generalized animatable neural representation from a sparse set of multi-view imagery of multiple persons. The learned representation can be used to synthesize novel view images of an arbitrary person from a sparse set of cameras, and further animate them with the user's pose control.
arXiv Detail & Related papers (2022-08-25T07:36:46Z)
NeuMan: Neural Human Radiance Field from a Single Video [26.7471970027198]
We train two NeRF models: a human NeRF model and a scene NeRF model. Our method is able to learn subject specific details, including cloth wrinkles and accessories, from just a 10 seconds video clip.
arXiv Detail & Related papers (2022-03-23T17:35:50Z)
Hallucinating Pose-Compatible Scenes [55.064949607528405]
We present a large-scale generative adversarial network for pose-conditioned scene generation. We curating a massive meta-dataset containing over 19 million frames of humans in everyday environments. We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose.
arXiv Detail & Related papers (2021-12-13T18:59:26Z)
Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis [124.48519390371636]
Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications. Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person. This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
arXiv Detail & Related papers (2021-10-27T03:42:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.