FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
Abstract Overview
FunREC is a training-free, optimization-based system that reconstructs functional 3D digital twins of indoor scenes from a single egocentric RGB-D interaction video. The method segments the video into static and dynamic fragments, discovers articulated parts, estimates their kinematic parameters (revolute or prismatic joints) and per-frame poses, and reconstructs both static geometry and moving parts in canonical space using TSDF fusion. It integrates geometric reasoning with semantic and motion priors from foundation models, including a video-language model for interaction detection, point trackers for sparse 3D trajectories, and SAM2 for dense part segmentation. The paper also introduces two new egocentric interaction datasets—RealFun4D (351 real interaction videos across 60 apartments) and OmniFun4D (127 photorealistic simulated sequences in 12 OmniGibson scenes)—to evaluate functional scene reconstruction.
Novelty
The primary contribution is scene-scale articulated reconstruction from in-the-wild egocentric interaction video without requiring controlled multi-state capture, CAD priors, pre-scanned object models, or any training. The work also introduces two new benchmarks (RealFun4D and OmniFun4D) specifically designed for evaluating functional 3D scene reconstruction from realistic human-scene interactions.
Results
Across OmniFun4D, HOI4D, and RealFun4D, FunREC achieves the best performance among all compared methods in articulated motion estimation, moving-part segmentation (mIoU of 77.9, 76.4, and 74.8 respectively, versus the next-best 23.6–26.8), 6D part pose estimation (up to 79.43% ADD-S / 69.85% ADD, representing more than a twofold improvement over BundleSDF), and reconstruction quality (Chamfer Distance of 3.2 cm, 0.7 cm, and 6.1 cm). The system is further demonstrated for URDF/USD export into physics simulators, hand-guided affordance mapping, and robot-scene interaction from human demonstrations.
Key Points
- FunREC reconstructs articulated indoor scenes from a single egocentric RGB-D interaction video by jointly estimating camera motion, part motion, articulation parameters (revolute and prismatic joints), and geometry through a training-free optimization pipeline leveraging foundation model priors.
- The authors introduce RealFun4D (351 real interaction videos across 60 apartments in four countries) and OmniFun4D (127 photorealistic simulated sequences in 12 OmniGibson scenes), two new datasets for evaluating functional 3D scene reconstruction.
- Experiments show large improvements over prior baselines—including over 50 mIoU gain in part segmentation and 5–10× lower articulation and pose errors—while producing simulation-compatible scene representations demonstrated in URDF/USD export, affordance mapping, and robotic interaction.