Detaching and Boosting: Dual Engine for Scale-Invariant Self-Supervised
Monocular Depth Estimation
- URL: http://arxiv.org/abs/2210.03952v1
- Date: Sat, 8 Oct 2022 07:38:11 GMT
- Title: Detaching and Boosting: Dual Engine for Scale-Invariant Self-Supervised
Monocular Depth Estimation
- Authors: Peizhe Jiang and Wei Yang and Xiaoqing Ye and Xiao Tan and Meng Wu
- Abstract summary: We present a scale-invariant approach for self-supervised MDE, in which scale-sensitive features (SSFs) are detached away.
To be specific, a simple but effective data augmentation by imitating the camera zooming process is proposed to detach SSFs.
Our approach achieves new State-of-The-Art performance against existing works from 0.097 to 0.090 w.r.t absolute relative error.
- Score: 18.741426143836538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular depth estimation (MDE) in the self-supervised scenario has emerged
as a promising method as it refrains from the requirement of ground truth
depth. Despite continuous efforts, MDE is still sensitive to scale changes
especially when all the training samples are from one single camera. Meanwhile,
it deteriorates further since camera movement results in heavy coupling between
the predicted depth and the scale change. In this paper, we present a
scale-invariant approach for self-supervised MDE, in which scale-sensitive
features (SSFs) are detached away while scale-invariant features (SIFs) are
boosted further. To be specific, a simple but effective data augmentation by
imitating the camera zooming process is proposed to detach SSFs, making the
model robust to scale changes. Besides, a dynamic cross-attention module is
designed to boost SIFs by fusing multi-scale cross-attention features
adaptively. Extensive experiments on the KITTI dataset demonstrate that the
detaching and boosting strategies are mutually complementary in MDE and our
approach achieves new State-of-The-Art performance against existing works from
0.097 to 0.090 w.r.t absolute relative error. The code will be made public
soon.
Related papers
- MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public.
ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance.
This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Gradient-Guided Modality Decoupling for Missing-Modality Robustness [24.95911972867697]
We introduce a novel indicator, gradients, to monitor and reduce modality dominance.
We present a novel Gradient-guided Modality Decoupling (GMD) method to decouple the dependency on dominating modalities.
In addition, to flexibly handle modal-incomplete data, we design a parameter-efficient Dynamic Sharing framework.
arXiv Detail & Related papers (2024-02-26T05:50:43Z) - Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone.
Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z) - DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking.
Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture.
We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z) - DeepMLE: A Robust Deep Maximum Likelihood Estimator for Two-view
Structure from Motion [9.294501649791016]
Two-view structure from motion (SfM) is the cornerstone of 3D reconstruction and visual SLAM (vSLAM)
We formulate the two-view SfM problem as a maximum likelihood estimation (MLE) and solve it with the proposed framework, denoted as DeepMLE.
Our method significantly outperforms the state-of-the-art end-to-end two-view SfM approaches in accuracy and generalization capability.
arXiv Detail & Related papers (2022-10-11T15:07:25Z) - Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular
Depth Estimation by Integrating IMU Motion Dynamics [74.1720528573331]
Unsupervised monocular depth and ego-motion estimation has drawn extensive research attention in recent years.
We propose DynaDepth, a novel scale-aware framework that integrates information from vision and IMU motion dynamics.
We validate the effectiveness of DynaDepth by conducting extensive experiments and simulations on the KITTI and Make3D datasets.
arXiv Detail & Related papers (2022-07-11T07:50:22Z) - Regularity Learning via Explicit Distribution Modeling for Skeletal
Video Anomaly Detection [43.004613173363566]
A novel Motion Embedder (ME) is proposed to provide a pose motion representation from the probability perspective.
A novel task-specific Spatial-Temporal Transformer (STT) is deployed for self-supervised pose sequence reconstruction.
MoPRL achieves the state-of-the-art performance by an average improvement of 4.7% AUC on several challenging datasets.
arXiv Detail & Related papers (2021-12-07T11:52:25Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z) - VMLoc: Variational Fusion For Learning-Based Multimodal Camera
Localization [46.607930208613574]
We propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space.
Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated.
arXiv Detail & Related papers (2020-03-12T14:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.