ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion
- URL: http://arxiv.org/abs/2509.07920v1
- Date: Tue, 09 Sep 2025 17:00:42 GMT
- Title: ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion
- Authors: Ao Li, Jinpeng Liu, Yixuan Zhu, Yansong Tang,
- Abstract summary: We introduce ScoreHOI, an effective diffusion-based model for joint human-object interaction reconstruction.<n>During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints.<n>We propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy.
- Score: 40.82318436286586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI's superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.
Related papers
- On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction [5.796028565806211]
We propose a semantic distribution-guided reconstruction framework that uses a pre-trained vision-language foundation model.<n>A contrastive objective aligns the reconstructed representation with the target semantic distribution.<n>Experiments on knee and brain datasets demonstrate that semantic priors from images preserve fine anatomical structures and achieve superior perceptual quality.
arXiv Detail & Related papers (2025-11-24T19:15:47Z) - ECHO: Ego-Centric modeling of Human-Object interactions [71.17118015822699]
ECHO (Ego-Centric modeling of Human-Object interactions) is developed.<n>It recovers three modalities: human pose, object motion, and contact from such minimal observation.<n>It outperforms existing methods that do not offer the same flexibility.
arXiv Detail & Related papers (2025-08-29T12:12:22Z) - Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance [61.41904916189093]
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images.<n>We use hand-object interaction as geometric guidance to ensure plausible hand-object interactions.
arXiv Detail & Related papers (2025-08-25T17:11:53Z) - Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning [50.76723760768117]
Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
arXiv Detail & Related papers (2025-07-03T12:19:26Z) - Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - Occluded Human Pose Estimation based on Limb Joint Augmentation [14.36131862057872]
We propose an occluded human pose estimation framework based on limb joint augmentation to enhance the generalization ability of the pose estimation model on the occluded human bodies.
To further enhance the localization ability of the model, this paper constructs a dynamic structure loss function based on limb graphs to explore the distribution of occluded joints.
arXiv Detail & Related papers (2024-10-13T15:48:24Z) - CausalDialogue: Modeling Utterance-level Causality in Conversations [83.03604651485327]
We have compiled and expanded upon a new dataset called CausalDialogue through crowd-sourcing.
This dataset includes multiple cause-effect pairs within a directed acyclic graph (DAG) structure.
We propose a causality-enhanced method called Exponential Average Treatment Effect (ExMATE) to enhance the impact of causality at the utterance level in training neural conversation models.
arXiv Detail & Related papers (2022-12-20T18:31:50Z) - Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse [20.258298183228824]
We introduce inference critics that detect and incentivize against posterior collapse by requiring correspondence between latent variables and the observations.
This approach is straightforward to implement and requires significantly less training time than prior methods.
arXiv Detail & Related papers (2022-07-19T20:07:17Z) - Real-time landmark detection for precise endoscopic submucosal
dissection via shape-aware relation network [51.44506007844284]
We propose a shape-aware relation network for accurate and real-time landmark detection in endoscopic submucosal dissection surgery.
We first devise an algorithm to automatically generate relation keypoint heatmaps, which intuitively represent the prior knowledge of spatial relations among landmarks.
We then develop two complementary regularization schemes to progressively incorporate the prior knowledge into the training process.
arXiv Detail & Related papers (2021-11-08T07:57:30Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - RobustFusion: Robust Volumetric Performance Reconstruction under
Human-object Interactions from Monocular RGBD Stream [27.600873320989276]
High-quality 4D reconstruction of human performance with complex interactions to various objects is essential in real-world scenarios.
Recent advances still fail to provide reliable performance reconstruction.
We propose RobustFusion, a robust volumetric performance reconstruction system for human-object interaction scenarios.
arXiv Detail & Related papers (2021-04-30T08:41:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.