EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting
Ego-Motion Rigidity
- URL: http://arxiv.org/abs/2309.01296v1
- Date: Mon, 4 Sep 2023 00:30:06 GMT
- Title: EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting
Ego-Motion Rigidity
- Authors: Zijie Jiang, Masatoshi Okutomi
- Abstract summary: Self-supervised monocular scene flow estimation has received increasing attention for its simple and economical sensor setup.
We propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning.
On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44%.
- Score: 13.02735046166494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular scene flow estimation, aiming to understand both 3D
structures and 3D motions from two temporally consecutive monocular images, has
received increasing attention for its simple and economical sensor setup.
However, the accuracy of current methods suffers from the bottleneck of
less-efficient network architecture and lack of motion rigidity for
regularization. In this paper, we propose a superior model named EMR-MSF by
borrowing the advantages of network architecture design under the scope of
supervised learning. We further impose explicit and robust geometric
constraints with an elaborately constructed ego-motion aggregation module where
a rigidity soft mask is proposed to filter out dynamic regions for stable
ego-motion estimation using static regions. Moreover, we propose a motion
consistency loss along with a mask regularization loss to fully exploit static
regions. Several efficient training strategies are integrated including a
gradient detachment technique and an enhanced view synthesis process for better
performance. Our proposed method outperforms the previous self-supervised works
by a large margin and catches up to the performance of supervised methods. On
the KITTI scene flow benchmark, our approach improves the SF-all metric of the
state-of-the-art self-supervised monocular method by 44% and demonstrates
superior performance across sub-tasks including depth and visual odometry,
amongst other self-supervised single-task or multi-task methods.
Related papers
- Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition [68.6707284662443]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic and static scenes plagued by severe invisibility and noise.
One critical aspect is formulating a consistency constraint specifically for temporal-spatial illumination and appearance enhanced versions.
We present an innovative video Retinex-based decomposition strategy that operates without the need for explicit supervision.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation [17.572459787107427]
We propose a flow distillation loss to replace the typical photometric loss and a prior flow based mask to remove invalid pixels.
Our approach achieves state-of-the-art results on both KITTI and NYU-Depth-v2 datasets.
arXiv Detail & Related papers (2023-01-20T04:02:13Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - MotionHint: Self-Supervised Monocular Visual Odometry with Motion
Constraints [70.76761166614511]
We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO)
Our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems.
arXiv Detail & Related papers (2021-09-14T15:35:08Z) - Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning.
We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z) - Unsupervised Motion Representation Enhanced Network for Action
Recognition [4.42249337449125]
Motion representation between consecutive frames has proven to have great promotion to video understanding.
TV-L1 method, an effective optical flow solver, is time-consuming and expensive in storage for caching the extracted optical flow.
We propose UF-TSN, a novel end-to-end action recognition approach enhanced with an embedded lightweight unsupervised optical flow estimator.
arXiv Detail & Related papers (2021-03-05T04:14:32Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.