EgoLifter: Open-world 3D Segmentation for Egocentric Perception
- URL: http://arxiv.org/abs/2403.18118v2
- Date: Mon, 22 Jul 2024 20:27:01 GMT
- Title: EgoLifter: Open-world 3D Segmentation for Egocentric Perception
- Authors: Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, Chris Sweeney,
- Abstract summary: We present EgoLifter, a system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects.
The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion.
- Score: 10.657841707131416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.
Related papers
- Ego3DT: Tracking Every 3D Object in Ego-centric Videos [20.96550148331019]
This paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video.
We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment.
We have also innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos.
arXiv Detail & Related papers (2024-10-11T05:02:31Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model.
Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z) - ONeRF: Unsupervised 3D Object Segmentation from Multiple Views [59.445957699136564]
ONeRF is a method that automatically segments and reconstructs object instances in 3D from multi-view RGB images without any additional manual annotations.
The segmented 3D objects are represented using separate Neural Radiance Fields (NeRFs) which allow for various 3D scene editing and novel view rendering.
arXiv Detail & Related papers (2022-11-22T06:19:37Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose
Estimation [25.03715978502528]
We propose a method for incorporating object interaction and human body dynamics into the task of 3D ego-pose estimation.
We use a kinematics model of the human body to represent the entire range of human motion, and a dynamics model of the body to interact with objects inside a physics simulator.
This is the first work to estimate a physically valid 3D full-body interaction sequence with objects from egocentric videos.
arXiv Detail & Related papers (2020-11-10T00:06:43Z) - Learning 3D Dynamic Scene Representations for Robot Manipulation [21.6131570689398]
3D scene representation for robot manipulation should capture three key object properties: permanency, completeness, and continuity.
We introduce 3D Dynamic Representation (DSR), a 3D scene representation that simultaneously discovers, tracks, reconstructs objects, and predicts their dynamics.
We propose DSR-Net, which learns to aggregate visual observations over multiple interactions to gradually build and refine DSR.
arXiv Detail & Related papers (2020-11-03T19:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.