PPEA-Depth: Progressive Parameter-Efficient Adaptation for
Self-Supervised Monocular Depth Estimation
- URL: http://arxiv.org/abs/2312.13066v2
- Date: Wed, 17 Jan 2024 00:39:17 GMT
- Title: PPEA-Depth: Progressive Parameter-Efficient Adaptation for
Self-Supervised Monocular Depth Estimation
- Authors: Yue-Jiang Dong, Yuan-Chen Guo, Ying-Tian Liu, Fang-Lue Zhang, Song-Hai
Zhang
- Abstract summary: We propose PPEA-Depth, a Progressive Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation.
The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets.
Experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.
- Score: 24.68378829544394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation is of significant importance with
applications spanning across autonomous driving and robotics. However, the
reliance on self-supervision introduces a strong static-scene assumption,
thereby posing challenges in achieving optimal performance in dynamic scenes,
which are prevalent in most real-world situations. To address these issues, we
propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to
transfer a pre-trained image model for self-supervised depth estimation. The
training comprises two sequential stages: an initial phase trained on a dataset
primarily composed of static scenes, succeeded by an expansion to more
intricate datasets involving dynamic scenes. To facilitate this process, we
design compact encoder and decoder adapters to enable parameter-efficient
tuning, allowing the network to adapt effectively. They not only uphold
generalized patterns from pre-trained image models but also retain knowledge
gained from the preceding phase into the subsequent one. Extensive experiments
demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI,
CityScapes and DDAD datasets.
Related papers
- DualAD: Disentangling the Dynamic and Static World for End-to-End Driving [11.379456277711379]
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline.
We propose dedicated representations to disentangle dynamic agents and static scene elements.
Our method titled DualAD outperforms independently trained single-task networks.
arXiv Detail & Related papers (2024-06-10T13:46:07Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Forecasting of depth and ego-motion with transformers and
self-supervision [0.0]
This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion.
Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a supervised self photometric loss.
The architecture is designed using both convolution and transformer modules.
arXiv Detail & Related papers (2022-06-15T10:14:11Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.