CLIP Can Understand Depth
- URL: http://arxiv.org/abs/2402.03251v1
- Date: Mon, 5 Feb 2024 18:09:33 GMT
- Title: CLIP Can Understand Depth
- Authors: Dunam Kim, Seokju Lee
- Abstract summary: We adapt CLIP for meaningful quality of monocular depth estimation with dense prediction.
Our model exhibits impressive performance matching several previous state-of-the-art vision-only models.
- Score: 5.6138460823631835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies on generalizing CLIP for monocular depth estimation reveal
that CLIP pre-trained on web-crawled data is inefficient for deriving proper
similarities between image patches and depth-related prompts. In this paper, we
adapt CLIP for meaningful quality of monocular depth estimation with dense
prediction, without fine-tuning its original vision-language alignment. By
jointly training a compact deconvolutional decoder with a tiny learnable
embedding matrix named mirror, as a static prompt for its text encoder, CLIP is
enabled to understand depth. With this approach, our model exhibits impressive
performance matching several previous state-of-the-art vision-only models on
the NYU Depth v2 and KITTI datasets, outperforming every CLIP-based depth
estimation model with a large margin. Experiments on temporal depth consistency
and spatial continuity demonstrate that the prior knowledge of CLIP can be
effectively refined by our proposed framework. Furthermore, an ablation study
on mirror proves that the resulting model estimates depth utilizing knowledge
not only from the image encoder but also text encoder despite not being given
any prompt written in a human way. This research demonstrates that through
minimal adjustments, the prior knowledge of vision-language foundation models,
such as CLIP, can be generalized even to domains where learning during
pretraining is challenging. We facilitate future works focused on methods to
adjust suboptimal prior knowledge of vision-language models using non-human
language prompts, achieving performance on par with task-specific
state-of-the-art methodologies.
Related papers
- CLIP with Quality Captions: A Strong Pretraining for Vision Tasks [16.208506912410147]
We show that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods.
We find that mobile architectures also benefit significantly from CLIP pretraining.
arXiv Detail & Related papers (2024-05-14T19:06:24Z) - Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation [31.34615135846137]
We propose a few-shot-based method which learns to adapt the Vision-Language Models for monocular depth estimation.
Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference.
With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6% in terms of MARE.
arXiv Detail & Related papers (2023-11-02T06:56:50Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot
Learning [14.496173899477283]
We study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts.
We propose to insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer.
We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted.
arXiv Detail & Related papers (2023-05-26T07:02:57Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.