Occlusion-Aware Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction
- URL: http://arxiv.org/abs/2507.08137v3
- Date: Sun, 14 Sep 2025 03:40:57 GMT
- Title: Occlusion-Aware Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction
- Authors: Hyungjun Doh, Dong In Lee, Seunggeun Chi, Pin-Hao Huang, Kwonjoon Lee, Sangpil Kim, Karthik Ramani,
- Abstract summary: We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video.<n>Our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions.<n>We validate our approach using 3D Gaussian Splatting on challenging monocular videos.
- Score: 29.807994561746437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.
Related papers
- MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments [12.796165448365949]
MOSAIC-GS is a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos.<n>We leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking.<n>We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods.
arXiv Detail & Related papers (2026-01-08T20:48:24Z) - InpaintHuman: Reconstructing Occluded Humans with Multi-Scale UV Mapping and Identity-Preserving Diffusion Inpainting [64.42884719282323]
InpaintHuman is a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos.<n>Our approach employs direct pixel-level supervision to ensure identity fidelity.
arXiv Detail & Related papers (2026-01-05T13:26:02Z) - IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes [25.939318593012484]
Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation.<n>We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories.<n>Our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining.
arXiv Detail & Related papers (2025-11-24T15:48:08Z) - Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance [61.41904916189093]
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images.<n>We use hand-object interaction as geometric guidance to ensure plausible hand-object interactions.
arXiv Detail & Related papers (2025-08-25T17:11:53Z) - SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition [14.381223353489062]
We propose textbfSplitGaussian, a novel framework that explicitly decomposes scene representations into static and dynamic components.<n>SplitGaussian outperforms prior state-of-the-art methods in rendering quality, geometric stability, and motion separation.
arXiv Detail & Related papers (2025-08-06T09:00:13Z) - VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling [68.65587507038539]
We present a novel video diffusion-enhanced 4D Gaussian Splatting framework for dynamic urban scene modeling.<n>Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model.<n>Our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB.
arXiv Detail & Related papers (2025-08-04T07:24:05Z) - GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z) - STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering [15.873329633980015]
Existing 3DGS-based methods for dynamic reconstruction often suffer from textbfSTDR (Spatio-coupling DeTemporal for Real-time rendering)<n>We propose textbfSTDR (Spatio-coupling DeTemporal for Real-time rendering), a plug-and-play module learns thattemporal probability distributions for each scene.
arXiv Detail & Related papers (2025-05-28T14:26:41Z) - Temporally Consistent Object-Centric Learning by Contrasting Slots [23.203973564679508]
We introduce a novel object-level temporal contrastive loss for video object-centric models.<n>Our method significantly improves the temporal consistency of the learned object-centric representations.
arXiv Detail & Related papers (2024-12-18T19:46:04Z) - Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos [101.48581851337703]
We present BTimer, the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes.<n>Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target ('bullet') timestamp by aggregating information from all the context frames.<n>Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets.
arXiv Detail & Related papers (2024-12-04T18:15:06Z) - Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction [50.873820265165975]
We introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction.<n>We propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling.<n>We contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T08:23:38Z) - TFS-NeRF: Template-Free NeRF for Semantic 3D Reconstruction of Dynamic Scene [25.164085646259856]
This paper introduces a template-free 3D semantic NeRF for dynamic scenes captured from sparse or singleview RGB videos.<n>By disentangling the motions of interacting entities and optimizing per-entity skinning weights, our method efficiently generates accurate, semantically separable geometries.
arXiv Detail & Related papers (2024-09-26T01:34:42Z) - Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion.
We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases.
Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z) - Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D
Human Motion Recovery from Monocular Videos [5.258814754543826]
We propose a novel method for temporally consistent motion estimation from a monocular video.
Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose.
Our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2023-11-20T10:53:59Z) - SCFusion: Real-time Incremental Scene Reconstruction with Semantic
Completion [86.77318031029404]
We propose a framework that performs scene reconstruction and semantic scene completion jointly in an incremental and real-time manner.
Our framework relies on a novel neural architecture designed to process occupancy maps and leverages voxel states to accurately and efficiently fuse semantic completion with the 3D global model.
arXiv Detail & Related papers (2020-10-26T15:31:52Z) - Consistency Guided Scene Flow Estimation [159.24395181068218]
CGSF is a self-supervised framework for the joint reconstruction of 3D scene structure and motion from stereo video.
We show that the proposed model can reliably predict disparity and scene flow in challenging imagery.
It achieves better generalization than the state-of-the-art, and adapts quickly and robustly to unseen domains.
arXiv Detail & Related papers (2020-06-19T17:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.