Deep Digging into the Generalization of Self-Supervised Monocular Depth
Estimation
- URL: http://arxiv.org/abs/2205.11083v3
- Date: Mon, 20 Mar 2023 03:52:42 GMT
- Title: Deep Digging into the Generalization of Self-Supervised Monocular Depth
Estimation
- Authors: Jinwoo Bae, Sungho Moon, Sunghoon Im
- Abstract summary: Self-supervised monocular depth estimation has been widely studied recently.
We investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation.
- Score: 12.336888210144936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation has been widely studied recently.
Most of the work has focused on improving performance on benchmark datasets,
such as KITTI, but has offered a few experiments on generalization performance.
In this paper, we investigate the backbone networks (e.g. CNNs, Transformers,
and CNN-Transformer hybrid models) toward the generalization of monocular depth
estimation. We first evaluate state-of-the-art models on diverse public
datasets, which have never been seen during the network training. Next, we
investigate the effects of texture-biased and shape-biased representations
using the various texture-shifted datasets that we generated. We observe that
Transformers exhibit a strong shape bias and CNNs do a strong texture-bias. We
also find that shape-biased models show better generalization performance for
monocular depth estimation compared to texture-biased models. Based on these
observations, we newly design a CNN-Transformer hybrid network with a
multi-level adaptive feature fusion module, called MonoFormer. The design
intuition behind MonoFormer is to increase shape bias by employing Transformers
while compensating for the weak locality bias of Transformers by adaptively
fusing multi-level representations. Extensive experiments show that the
proposed method achieves state-of-the-art performance with various public
datasets. Our method also shows the best generalization ability among the
competitive methods.
Related papers
- Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN [9.185929396989083]
We employ a sparse pixel approach to contrastively analyze the distinctions between Transformers and CNNs.
Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity.
We propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration.
arXiv Detail & Related papers (2023-08-16T12:46:52Z) - CompletionFormer: Depth Completion with Convolutions and Vision
Transformers [0.0]
This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
Our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods.
arXiv Detail & Related papers (2023-04-25T17:59:47Z) - A Study on the Generality of Neural Network Structures for Monocular
Depth Estimation [14.09373215954704]
We deeply investigate the various backbone networks toward the generalization of monocular depth estimation.
We evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets.
We observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias.
arXiv Detail & Related papers (2023-01-09T04:58:12Z) - MonoViT: Self-Supervised Monocular Depth Estimation with a Vision
Transformer [52.0699787446221]
We propose MonoViT, a framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.
By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy.
arXiv Detail & Related papers (2022-08-06T16:54:45Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets.
In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset.
We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.