SIRE: SE(3) Intrinsic Rigidity Embeddings
- URL: http://arxiv.org/abs/2503.07739v1
- Date: Mon, 10 Mar 2025 18:00:30 GMT
- Title: SIRE: SE(3) Intrinsic Rigidity Embeddings
- Authors: Cameron Smith, Basile Van Hoorick, Vitor Guizilini, Yue Wang,
- Abstract summary: We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes.<n>Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss.<n>Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.
- Score: 16.630400019100943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure - highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.
Related papers
- St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World [106.91539872943864]
St4RTrack is a framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs.
We predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry.
We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
arXiv Detail & Related papers (2025-04-17T17:55:58Z) - 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.<n>voxelization infers per-object occupancy probabilities at individual spatial locations.<n>Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion.
We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases.
Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z) - Fast Encoder-Based 3D from Casual Videos via Point Track Processing [22.563073026889324]
We present TracksTo4D, a learning-based approach that enables inferring 3D structure and camera positions from dynamic content originating from casual videos.
TracksTo4D is trained in an unsupervised way on a dataset of casual videos.
Experiments show that TracksTo4D can reconstruct a temporal point cloud and camera positions of the underlying video with accuracy comparable to state-of-the-art methods.
arXiv Detail & Related papers (2024-04-10T15:37:00Z) - Visual Geometry Grounded Deep Structure From Motion [20.203320509695306]
We propose a new deep pipeline VGGSfM, where each component is fully differentiable and can be trained in an end-to-end manner.
First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches.
We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.
arXiv Detail & Related papers (2023-12-07T18:59:52Z) - Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation.
Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames.
The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z) - Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping [23.456046776979903]
We propose to leverage multiview data of textitstatic points in arbitrary scenes (static or dynamic) to learn a neural 3D mapping module.
The neural 3D mapper consumes RGB-D data as input, and produces a 3D voxel grid of deep features as output.
We show that our unsupervised 3D object trackers outperform prior unsupervised 2D and 2.5D trackers, and approach the accuracy of supervised trackers.
arXiv Detail & Related papers (2020-08-04T02:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.