Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation
- URL: http://arxiv.org/abs/2311.01034v1
- Date: Thu, 2 Nov 2023 06:56:50 GMT
- Title: Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation
- Authors: Xueting Hu, Ce Zhang, Yi Zhang, Bowen Hai, Ke Yu, Zhihai He
- Abstract summary: We propose a few-shot-based method which learns to adapt the Vision-Language Models for monocular depth estimation.
Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference.
With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6% in terms of MARE.
- Score: 31.34615135846137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced
performance across a range of tasks that involve the integration of visual and
linguistic modalities. When CLIP is used for depth estimation tasks, the
patches, divided from the input images, can be combined with a series of
semantic descriptions of the depth information to obtain similarity results.
The coarse estimation of depth is then achieved by weighting and summing the
depth values, called depth bins, corresponding to the predefined semantic
descriptions. The zero-shot approach circumvents the computational and
time-intensive nature of traditional fully-supervised depth estimation methods.
However, this method, utilizing fixed depth bins, may not effectively
generalize as images from different scenes may exhibit distinct depth
distributions. To address this challenge, we propose a few-shot-based method
which learns to adapt the VLMs for monocular depth estimation to balance
training costs and generalization capabilities. Specifically, it assigns
different depth bins for different scenes, which can be selected by the model
during inference. Additionally, we incorporate learnable prompts to preprocess
the input text to convert the easily human-understood text into easily
model-understood vectors and further enhance the performance. With only one
image per scene for training, our extensive experiment results on the NYU V2
and KITTI dataset demonstrate that our method outperforms the previous
state-of-the-art method by up to 10.6\% in terms of MARE.
Related papers
- ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth.
Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module.
Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z) - CLIP Can Understand Depth [5.6138460823631835]
We adapt CLIP for meaningful quality of monocular depth estimation with dense prediction.
Our model exhibits impressive performance matching several previous state-of-the-art vision-only models.
arXiv Detail & Related papers (2024-02-05T18:09:33Z) - Single Image Depth Prediction Made Better: A Multivariate Gaussian Take [163.14849753700682]
We introduce an approach that performs continuous modeling of per-pixel depth.
Our method's accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard.
arXiv Detail & Related papers (2023-03-31T16:01:03Z) - Boosting Weakly Supervised Object Detection using Fusion and Priors from
Hallucinated Depth [33.66537809438079]
We propose an amplifier method for enhancing the performance of weakly-supervised object detection (WSOD)
By analyzing the relationship between language context and depth, we calculate depth priors to identify bounding box proposals that may contain an object of interest.
Our proposed method is evaluated on six datasets by implementing it on top of two state-of-the-art WSOD methods.
arXiv Detail & Related papers (2023-03-20T08:26:29Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Monocular Depth Estimation Using Cues Inspired by Biological Vision
Systems [22.539300644593936]
Monocular depth estimation (MDE) aims to transform an RGB image of a scene into a pixelwise depth map from the same camera view.
Part of the MDE task is to learn which visual cues in the image can be used for depth estimation, and how.
We demonstrate that explicitly injecting visual cue information into the model is beneficial for depth estimation.
arXiv Detail & Related papers (2022-04-21T19:42:36Z) - X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task
Distillation [69.9604394044652]
We propose a novel method to improve the self-supervised training of monocular depth via cross-task knowledge distillation.
During training, we utilize a pretrained semantic segmentation teacher network and transfer its semantic knowledge to the depth network.
We extensively evaluate the efficacy of our proposed approach on the KITTI benchmark and compare it with the latest state of the art.
arXiv Detail & Related papers (2021-10-24T19:47:14Z) - ADAADepth: Adapting Data Augmentation and Attention for Self-Supervised
Monocular Depth Estimation [8.827921242078881]
We propose ADAA, utilising depth augmentation as depth supervision for learning accurate and robust depth.
We propose a relational self-attention module that learns rich contextual features and further enhances depth results.
We evaluate our predicted depth on the KITTI driving dataset and achieve state-of-the-art results.
arXiv Detail & Related papers (2021-03-01T09:06:55Z) - DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data [110.29043712400912]
We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation.
Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting.
arXiv Detail & Related papers (2020-02-03T05:38:33Z) - Single Image Depth Estimation Trained via Depth from Defocus Cues [105.67073923825842]
Estimating depth from a single RGB image is a fundamental task in computer vision.
In this work, we rely, instead of different views, on depth from focus cues.
We present results that are on par with supervised methods on KITTI and Make3D datasets and outperform unsupervised learning approaches.
arXiv Detail & Related papers (2020-01-14T20:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.