Learning to Segment Rigid Motions from Two Frames
- URL: http://arxiv.org/abs/2101.03694v1
- Date: Mon, 11 Jan 2021 04:20:30 GMT
- Title: Learning to Segment Rigid Motions from Two Frames
- Authors: Gengshan Yang and Deva Ramanan
- Abstract summary: We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
- Score: 72.14906744113125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Appearance-based detectors achieve remarkable performance on common scenes,
but tend to fail for scenarios lack of training data. Geometric motion
segmentation algorithms, however, generalize to novel scenes, but have yet to
achieve comparable performance to appearance-based ones, due to noisy motion
estimations and degenerate motion configurations. To combine the best of both
worlds, we propose a modular network, whose architecture is motivated by a
geometric analysis of what independent object motions can be recovered from an
egomotion field. It takes two consecutive frames as input and predicts
segmentation masks for the background and multiple rigidly moving objects,
which are then parameterized by 3D rigid transformations. Our method achieves
state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
The inferred rigid motions lead to a significant improvement for depth and
scene flow estimation. At the time of submission, our method ranked 1st on
KITTI scene flow leaderboard, out-performing the best published method (scene
flow error: 4.89% vs 6.31%).
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion.
We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases.
Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z) - PARIS: Part-level Reconstruction and Motion Analysis for Articulated
Objects [17.191728053966873]
We address the task of simultaneous part-level reconstruction and motion parameter estimation for articulated objects.
We present PARIS: a self-supervised, end-to-end architecture that learns part-level implicit shape and appearance models.
Our method generalizes better across object categories, and outperforms baselines and prior work that are given 3D point clouds as input.
arXiv Detail & Related papers (2023-08-14T18:18:00Z) - Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and
Motion Estimation [49.56131393810713]
We present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner.
Our method excels in both model performance and computational efficiency, with only 0.25M parameters and 0.92G FLOPs.
arXiv Detail & Related papers (2023-06-08T22:55:32Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.