FuguReport

FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

Authors Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath
Affiliations ETH Zurich / Max Planck Institute for Informatics / Stanford University / USI Lugano / Microsoft
Categories Method / 3D Reconstruction / Reconstructing functional indoor scenes, Application / Robotic Interaction / Robot-scene interaction simulation, Evaluation / Simulation Export / URDF/USD format export and mapping
License CC BY 4.0

Abstract Overview

FunREC is a training-free, optimization-based system that reconstructs functional 3D digital twins of indoor scenes from a single egocentric RGB-D interaction video. The method segments the video into static and dynamic fragments, discovers articulated parts, estimates their kinematic parameters (revolute or prismatic joints) and per-frame poses, and reconstructs both static geometry and moving parts in canonical space using TSDF fusion. It integrates geometric reasoning with semantic and motion priors from foundation models, including a video-language model for interaction detection, point trackers for sparse 3D trajectories, and SAM2 for dense part segmentation. The paper also introduces two new egocentric interaction datasets—RealFun4D (351 real interaction videos across 60 apartments) and OmniFun4D (127 photorealistic simulated sequences in 12 OmniGibson scenes)—to evaluate functional scene reconstruction.

Novelty

The primary contribution is scene-scale articulated reconstruction from in-the-wild egocentric interaction video without requiring controlled multi-state capture, CAD priors, pre-scanned object models, or any training. The work also introduces two new benchmarks (RealFun4D and OmniFun4D) specifically designed for evaluating functional 3D scene reconstruction from realistic human-scene interactions.

Results

Across OmniFun4D, HOI4D, and RealFun4D, FunREC achieves the best performance among all compared methods in articulated motion estimation, moving-part segmentation (mIoU of 77.9, 76.4, and 74.8 respectively, versus the next-best 23.6–26.8), 6D part pose estimation (up to 79.43% ADD-S / 69.85% ADD, representing more than a twofold improvement over BundleSDF), and reconstruction quality (Chamfer Distance of 3.2 cm, 0.7 cm, and 6.1 cm). The system is further demonstrated for URDF/USD export into physics simulators, hand-guided affordance mapping, and robot-scene interaction from human demonstrations.

Key Points

  1. FunREC reconstructs articulated indoor scenes from a single egocentric RGB-D interaction video by jointly estimating camera motion, part motion, articulation parameters (revolute and prismatic joints), and geometry through a training-free optimization pipeline leveraging foundation model priors.
  2. The authors introduce RealFun4D (351 real interaction videos across 60 apartments in four countries) and OmniFun4D (127 photorealistic simulated sequences in 12 OmniGibson scenes), two new datasets for evaluating functional 3D scene reconstruction.
  3. Experiments show large improvements over prior baselines—including over 50 mIoU gain in part segmentation and 5–10× lower articulation and pose errors—while producing simulation-compatible scene representations demonstrated in URDF/USD export, affordance mapping, and robotic interaction.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.