PPEA-Depth: Progressive Parameter-Efficient Adaptation for
Self-Supervised Monocular Depth Estimation
- URL: http://arxiv.org/abs/2312.13066v2
- Date: Wed, 17 Jan 2024 00:39:17 GMT
- Title: PPEA-Depth: Progressive Parameter-Efficient Adaptation for
Self-Supervised Monocular Depth Estimation
- Authors: Yue-Jiang Dong, Yuan-Chen Guo, Ying-Tian Liu, Fang-Lue Zhang, Song-Hai
Zhang
- Abstract summary: We propose PPEA-Depth, a Progressive Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation.
The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets.
Experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.
- Score: 24.68378829544394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation is of significant importance with
applications spanning across autonomous driving and robotics. However, the
reliance on self-supervision introduces a strong static-scene assumption,
thereby posing challenges in achieving optimal performance in dynamic scenes,
which are prevalent in most real-world situations. To address these issues, we
propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to
transfer a pre-trained image model for self-supervised depth estimation. The
training comprises two sequential stages: an initial phase trained on a dataset
primarily composed of static scenes, succeeded by an expansion to more
intricate datasets involving dynamic scenes. To facilitate this process, we
design compact encoder and decoder adapters to enable parameter-efficient
tuning, allowing the network to adapt effectively. They not only uphold
generalized patterns from pre-trained image models but also retain knowledge
gained from the preceding phase into the subsequent one. Extensive experiments
demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI,
CityScapes and DDAD datasets.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - DepthART: Monocular Depth Estimation as Autoregressive Refinement Task [2.3884184860468136]
We introduce the first autoregressive depth estimation model based on the visual autoregressive transformer.
Our primary contribution is DepthART, a novel training method formulated as Depth Autoregressive Refinement Task.
Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task.
arXiv Detail & Related papers (2024-09-23T13:36:34Z) - Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - DualAD: Disentangling the Dynamic and Static World for End-to-End Driving [11.379456277711379]
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline.
We propose dedicated representations to disentangle dynamic agents and static scene elements.
Our method titled DualAD outperforms independently trained single-task networks.
arXiv Detail & Related papers (2024-06-10T13:46:07Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Forecasting of depth and ego-motion with transformers and
self-supervision [0.0]
This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion.
Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a supervised self photometric loss.
The architecture is designed using both convolution and transformer modules.
arXiv Detail & Related papers (2022-06-15T10:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.