Related papers: Detaching and Boosting: Dual Engine for Scale-Invariant Self-Supervised Monocular Depth Estimation

Detaching and Boosting: Dual Engine for Scale-Invariant Self-Supervised Monocular Depth Estimation

URL: http://arxiv.org/abs/2210.03952v1
Date: Sat, 8 Oct 2022 07:38:11 GMT
Title: Detaching and Boosting: Dual Engine for Scale-Invariant Self-Supervised Monocular Depth Estimation
Authors: Peizhe Jiang and Wei Yang and Xiaoqing Ye and Xiao Tan and Meng Wu
Abstract summary: We present a scale-invariant approach for self-supervised MDE, in which scale-sensitive features (SSFs) are detached away. To be specific, a simple but effective data augmentation by imitating the camera zooming process is proposed to detach SSFs. Our approach achieves new State-of-The-Art performance against existing works from 0.097 to 0.090 w.r.t absolute relative error.
Score: 18.741426143836538
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monocular depth estimation (MDE) in the self-supervised scenario has emerged as a promising method as it refrains from the requirement of ground truth depth. Despite continuous efforts, MDE is still sensitive to scale changes especially when all the training samples are from one single camera. Meanwhile, it deteriorates further since camera movement results in heavy coupling between the predicted depth and the scale change. In this paper, we present a scale-invariant approach for self-supervised MDE, in which scale-sensitive features (SSFs) are detached away while scale-invariant features (SIFs) are boosted further. To be specific, a simple but effective data augmentation by imitating the camera zooming process is proposed to detach SSFs, making the model robust to scale changes. Besides, a dynamic cross-attention module is designed to boost SIFs by fusing multi-scale cross-attention features adaptively. Extensive experiments on the KITTI dataset demonstrate that the detaching and boosting strategies are mutually complementary in MDE and our approach achieves new State-of-The-Art performance against existing works from 0.097 to 0.090 w.r.t absolute relative error. The code will be made public soon.

Related papers

Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection [12.964308630328688]
Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications.<n>ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds.<n>This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling.
arXiv Detail & Related papers (2025-05-29T07:55:23Z)
Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers [5.187307904567701]
We propose a magnitude-preserving design that stabilizes training without normalization layers.<n>Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation.<n>We show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $sim$12.8%.
arXiv Detail & Related papers (2025-05-25T12:25:50Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space [20.361790608772157]
Latent Diffusion Models (LDMs) are known to have an unstable generation process. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant.
arXiv Detail & Related papers (2025-03-12T14:16:30Z)
Deep Autoencoder with SVD-Like Convergence and Flat Minima [1.0742675209112622]
We propose a learnable weighted hybrid autoencoder to overcome the Kolmogorov barrier. We empirically find that our trained model has a sharpness thousands of times smaller compared to other models.
arXiv Detail & Related papers (2024-10-23T00:04:26Z)
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public. ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z)
Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet) AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z)
Gradient-Guided Modality Decoupling for Missing-Modality Robustness [24.95911972867697]
We introduce a novel indicator, gradients, to monitor and reduce modality dominance. We present a novel Gradient-guided Modality Decoupling (GMD) method to decouple the dependency on dominating modalities. In addition, to flexibly handle modal-incomplete data, we design a parameter-efficient Dynamic Sharing framework.
arXiv Detail & Related papers (2024-02-26T05:50:43Z)
Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z)
DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z)
Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics [74.1720528573331]
Unsupervised monocular depth and ego-motion estimation has drawn extensive research attention in recent years. We propose DynaDepth, a novel scale-aware framework that integrates information from vision and IMU motion dynamics. We validate the effectiveness of DynaDepth by conducting extensive experiments and simulations on the KITTI and Make3D datasets.
arXiv Detail & Related papers (2022-07-11T07:50:22Z)
Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection [43.004613173363566]
A novel Motion Embedder (ME) is proposed to provide a pose motion representation from the probability perspective. A novel task-specific Spatial-Temporal Transformer (STT) is deployed for self-supervised pose sequence reconstruction. MoPRL achieves the state-of-the-art performance by an average improvement of 4.7% AUC on several challenging datasets.
arXiv Detail & Related papers (2021-12-07T11:52:25Z)
Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images. Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z)
VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization [46.607930208613574]
We propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated.
arXiv Detail & Related papers (2020-03-12T14:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.