ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
- URL: http://arxiv.org/abs/2403.18807v4
- Date: Wed, 17 Apr 2024 14:59:51 GMT
- Title: ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
- Authors: Suraj Patni, Aradhye Agarwal, Chetan Arora,
- Abstract summary: In absence of parallax cues, a learning-based single image depth estimation model relies heavily on shading and contextual cues in the image.
It is necessary to train such models on large and varied datasets, which are difficult to capture.
Using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications.
- Score: 5.179738379203527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of 0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io
Related papers
- Tiny models from tiny data: Textual and null-text inversion for few-shot distillation [11.80626524879555]
Few-shot image classification involves classifying images using very few training examples.
Recent vision foundation models show excellent few-shot transfer abilities, but are large and slow at inference.
We present a novel diffusion model inversion technique (TINT) combining the diversity of textual inversion with the specificity of null-text inversion.
arXiv Detail & Related papers (2024-06-05T11:01:42Z) - Dataset Distillation via Curriculum Data Synthesis in Large Data Era [26.883100340763317]
We introduce a simple yet effective global-to-local gradient refinement approach enabled by curriculum data augmentation during data synthesis.
The proposed model outperforms the current state-of-the-art methods like SRe$2$L, TESLA, and MTT by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterparts to less than absolute 15%.
arXiv Detail & Related papers (2023-11-30T18:59:56Z) - Learning from History: Task-agnostic Model Contrastive Learning for
Image Restoration [79.04007257606862]
This paper introduces an innovative method termed 'learning from history', which dynamically generates negative samples from the target model itself.
Our approach, named Model Contrastive Learning for Image Restoration (MCLIR), rejuvenates latency models as negative models, making it compatible with diverse image restoration tasks.
arXiv Detail & Related papers (2023-09-12T07:50:54Z) - DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion
Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations.
Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation.
We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z) - RemoteCLIP: A Vision Language Foundation Model for Remote Sensing [13.814076157988225]
We propose RemoteCLIP, a vision-language foundation model for remote sensing.
It aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application.
RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $textitk$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images.
arXiv Detail & Related papers (2023-06-19T15:46:41Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - NoiSER: Noise is All You Need for Enhancing Low-Light Images Without
Task-Related Data [103.04999391668753]
We show that it is possible to enhance a low-light image without any task-related training data.
Technically, we propose a new, magical, effective and efficient method, termed underlineNoise underlineSElf-underlineRegression (NoiSER)
Our NoiSER is highly competitive to current task-related data based LLIE models in terms of quantitative and visual results.
arXiv Detail & Related papers (2022-11-09T06:18:18Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Core Risk Minimization using Salient ImageNet [53.616101711801484]
We introduce the Salient Imagenet dataset with more than 1 million soft masks localizing core and spurious features for all 1000 Imagenet classes.
Using this dataset, we first evaluate the reliance of several Imagenet pretrained models (42 total) on spurious features.
Next, we introduce a new learning paradigm called Core Risk Minimization (CoRM) whose objective ensures that the model predicts a class using its core features.
arXiv Detail & Related papers (2022-03-28T01:53:34Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.