Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning
- URL: http://arxiv.org/abs/2507.02565v1
- Date: Thu, 03 Jul 2025 12:19:26 GMT
- Title: Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning
- Authors: Buzhen Huang, Chen Li, Chongyang Xu, Dongyue Lu, Jinnan Chen, Yangang Wang, Gim Hee Lee,
- Abstract summary: Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
- Score: 50.76723760768117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.
Related papers
- Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting [4.568580817155409]
Amodal completion is crucial for understanding human-object interactions in computer vision and robotics.<n>We develop a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI.<n>Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios.
arXiv Detail & Related papers (2025-08-01T08:33:45Z) - HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly [15.347208661111198]
HumanSAM aims to classify humancentric forgeries into three distinct types of artifacts commonly observed in generated content.<n>HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
arXiv Detail & Related papers (2025-07-26T12:03:47Z) - Pose Priors from Language Models [74.61186408764559]
Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information.<n>We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses.
arXiv Detail & Related papers (2024-05-06T17:59:36Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.<n>Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views [32.940614931864154]
We propose a scene-conditioned diffusion method to model the body pose distribution.
The method generates bodies in plausible human-scene interactions.
It achieves superior accuracy for visible joints and diversity for invisible body parts.
arXiv Detail & Related papers (2023-04-12T17:58:57Z) - Explicit Occlusion Reasoning for Multi-person 3D Human Pose Estimation [33.86986028882488]
Occlusion poses a great threat to monocular multi-person 3D human pose estimation due to large variability in terms of the shape, appearance, and position of occluders.
Existing methods try to handle occlusion with pose priors/constraints, data augmentation, or implicit reasoning.
We develop a method to explicitly model this process that significantly improves bottom-up multi-person human pose estimation.
arXiv Detail & Related papers (2022-07-29T22:12:50Z) - Occluded Human Body Capture with Self-Supervised Spatial-Temporal Motion
Prior [7.157324258813676]
We build the first 3D occluded motion dataset(OcMotion), which can be used for both training and testing.
A spatial-temporal layer is then designed to learn joint-level correlations.
Experimental results show that our method can generate accurate and coherent human motions from occluded videos with good generalization ability and runtime efficiency.
arXiv Detail & Related papers (2022-07-12T08:15:11Z) - LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human
Bodies [78.17425779503047]
We propose a novel neural implicit representation for the human body.
It is fully differentiable and optimizable with disentangled shape and pose latent spaces.
Our model can be trained and fine-tuned directly on non-watertight raw data with well-designed losses.
arXiv Detail & Related papers (2021-11-30T04:10:57Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.