Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting
- URL: http://arxiv.org/abs/2508.00427v1
- Date: Fri, 01 Aug 2025 08:33:45 GMT
- Title: Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting
- Authors: Seunggeun Chi, Enna Sachdeva, Pin-Hao Huang, Kwonjoon Lee,
- Abstract summary: Amodal completion is crucial for understanding human-object interactions in computer vision and robotics.<n>We develop a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI.<n>Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios.
- Score: 4.568580817155409
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Amodal completion, which is the process of inferring the full appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, such as those that use pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios because they have a limited understanding of HOI. To solve this problem, we've developed a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI. By incorporating physical constraints from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to be, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method uses customized denoising strategies across these regions within a diffusion model. This improves the accuracy and realism of the generated completions in both their shape and visual detail. Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios, moving machine perception closer to a more human-like understanding of dynamic environments. We also show that our pipeline is robust even without ground-truth contact annotations, which broadens its applicability to tasks like 3D reconstruction and novel view/pose synthesis.
Related papers
- Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning [64.32618490065117]
A core problem of Embodied AI is to learn object manipulation from observation, as humans do.<n>We propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy.<n> Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.
arXiv Detail & Related papers (2025-08-02T04:14:18Z) - Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation [55.2480439325792]
Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing.<n>This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively.
arXiv Detail & Related papers (2025-07-30T19:43:47Z) - Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning [50.76723760768117]
Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
arXiv Detail & Related papers (2025-07-03T12:19:26Z) - Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes.<n>We show it outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models [8.933560282929726]
We introduce a novel affordance representation, named Comprehensive Affordance (ComA)
Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes.
We demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance.
arXiv Detail & Related papers (2024-01-23T18:59:59Z) - ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection [70.11264880907652]
Recent object (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios.
We propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and camouflaged zooming in and out.
Our framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks.
arXiv Detail & Related papers (2023-10-31T06:11:23Z) - ArK: Augmented Reality with Knowledge Interactive Emergent Ability [115.72679420999535]
We develop an infinite agent that learns to transfer knowledge memory from general foundation models to novel domains.
The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK)
We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes.
arXiv Detail & Related papers (2023-05-01T17:57:01Z) - Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity
Resolution [0.0]
We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description.
Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains.
We introduce a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations.
arXiv Detail & Related papers (2022-05-24T14:12:32Z) - Effective Actor-centric Human-object Interaction Detection [20.564689533862524]
We propose a novel actor-centric framework to detect Human-Object Interaction in images.
Our method achieves the state-of-the-art on the challenging V-COCO and HICO-DET benchmarks.
arXiv Detail & Related papers (2022-02-24T10:24:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.