DytanVO: Joint Refinement of Visual Odometry and Motion Segmentation in
Dynamic Environments
- URL: http://arxiv.org/abs/2209.08430v4
- Date: Sat, 29 Apr 2023 04:37:57 GMT
- Title: DytanVO: Joint Refinement of Visual Odometry and Motion Segmentation in
Dynamic Environments
- Authors: Shihao Shen and Yilin Cai and Wenshan Wang and Sebastian Scherer
- Abstract summary: We present DytanVO, the first supervised learning-based VO method that deals with dynamic environments.
Our method achieves an average improvement of 27.7% in ATE over state-of-the-art VO solutions in real-world dynamic environments.
- Score: 6.5121327691369615
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Learning-based visual odometry (VO) algorithms achieve remarkable performance
on common static scenes, benefiting from high-capacity models and massive
annotated data, but tend to fail in dynamic, populated environments. Semantic
segmentation is largely used to discard dynamic associations before estimating
camera motions but at the cost of discarding static features and is hard to
scale up to unseen categories. In this paper, we leverage the mutual dependence
between camera ego-motion and motion segmentation and show that both can be
jointly refined in a single learning-based framework. In particular, we present
DytanVO, the first supervised learning-based VO method that deals with dynamic
environments. It takes two consecutive monocular frames in real-time and
predicts camera ego-motion in an iterative fashion. Our method achieves an
average improvement of 27.7% in ATE over state-of-the-art VO solutions in
real-world dynamic environments, and even performs competitively among dynamic
visual SLAM systems which optimize the trajectory on the backend. Experiments
on plentiful unseen environments also demonstrate our method's
generalizability.
Related papers
- DynaVINS++: Robust Visual-Inertial State Estimator in Dynamic Environments by Adaptive Truncated Least Squares and Stable State Recovery [11.37707868611451]
We propose a robust VINS framework called mboxtextitDynaVINS++.
Our approach shows promising performance in dynamic environments, including scenes with abruptly dynamic objects.
arXiv Detail & Related papers (2024-10-20T12:13:45Z) - Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective [10.938290904843939]
We propose "Bi-level Optimization of Learning Dynamic with Decoupling and Intervention" (BOLD-DI) to capture both static and dynamic semantics in a decoupled manner.
Our method can be seamlessly integrated into the existing v-CL methods and experimental results highlight the significant improvements.
arXiv Detail & Related papers (2024-07-19T06:53:54Z) - Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM [17.661231232206028]
Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention.
We propose a novel SLAM framework for dynamic environments.
arXiv Detail & Related papers (2024-07-18T09:35:48Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation.
HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model.
We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - HyperDynamics: Meta-Learning Object and Agent Dynamics with
Hypernetworks [18.892883695539002]
HyperDynamics is a dynamics meta-learning framework that generates parameters of neural dynamics models.
It outperforms existing models that adapt to environment variations by learning dynamics over high dimensional visual observations.
We show our method matches the performance of an ensemble of separately trained experts, while also being able to generalize well to unseen environment variations at test time.
arXiv Detail & Related papers (2021-03-17T04:48:43Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z) - ClusterVO: Clustering Moving Instances and Estimating Visual Odometry
for Self and Surroundings [54.33327082243022]
ClusterVO is a stereo Visual Odometry which simultaneously clusters and estimates the motion of both ego and surrounding rigid clusters/objects.
Unlike previous solutions relying on batch input or imposing priors on scene structure or dynamic object models, ClusterVO is online, general and thus can be used in various scenarios including indoor scene understanding and autonomous driving.
arXiv Detail & Related papers (2020-03-29T09:06:28Z) - FlowFusion: Dynamic Dense RGB-D SLAM Based on Optical Flow [17.040818114071833]
We present a novel dense RGB-D SLAM solution that simultaneously accomplishes the dynamic/static segmentation and camera ego-motion estimation.
Our novelty is using optical flow residuals to highlight the dynamic semantics in the RGB-D point clouds.
arXiv Detail & Related papers (2020-03-11T04:00:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.