A Study on the Generality of Neural Network Structures for Monocular
Depth Estimation
- URL: http://arxiv.org/abs/2301.03169v3
- Date: Sun, 10 Dec 2023 23:38:26 GMT
- Title: A Study on the Generality of Neural Network Structures for Monocular
Depth Estimation
- Authors: Jinwoo Bae and Kyumin Hwang and Sunghoon Im
- Abstract summary: We deeply investigate the various backbone networks toward the generalization of monocular depth estimation.
We evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets.
We observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias.
- Score: 14.09373215954704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular depth estimation has been widely studied, and significant
improvements in performance have been recently reported. However, most previous
works are evaluated on a few benchmark datasets, such as KITTI datasets, and
none of the works provide an in-depth analysis of the generalization
performance of monocular depth estimation. In this paper, we deeply investigate
the various backbone networks (e.g.CNN and Transformer models) toward the
generalization of monocular depth estimation. First, we evaluate
state-of-the-art models on both in-distribution and out-of-distribution
datasets, which have never been seen during network training. Then, we
investigate the internal properties of the representations from the
intermediate layers of CNN-/Transformer-based models using synthetic
texture-shifted datasets. Through extensive experiments, we observe that the
Transformers exhibit a strong shape-bias rather than CNNs, which have a strong
texture-bias. We also discover that texture-biased models exhibit worse
generalization performance for monocular depth estimation than shape-biased
models. We demonstrate that similar aspects are observed in real-world driving
datasets captured under diverse environments. Lastly, we conduct a dense
ablation study with various backbone networks which are utilized in modern
strategies. The experiments demonstrate that the intrinsic locality of the CNNs
and the self-attention of the Transformers induce texture-bias and shape-bias,
respectively.
Related papers
- Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry [1.2289361708127877]
We propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry.
The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference.
arXiv Detail & Related papers (2024-09-13T12:21:25Z) - Impacts of Color and Texture Distortions on Earth Observation Data in Deep Learning [5.128534415575421]
Land cover classification and change detection are important applications of remote sensing and Earth observation.
We show that the influence of different visual characteristics of the input EO data on a model's predictions is not well understood.
We conduct experiments with multiple state-of-the-art segmentation networks for land cover classification and show that they are in general more sensitive to texture than to color distortions.
arXiv Detail & Related papers (2024-03-07T10:25:23Z) - Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity.
Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z) - Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN [9.185929396989083]
We employ a sparse pixel approach to contrastively analyze the distinctions between Transformers and CNNs.
Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity.
We propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration.
arXiv Detail & Related papers (2023-08-16T12:46:52Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables.
The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning.
We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z) - Deep Digging into the Generalization of Self-Supervised Monocular Depth
Estimation [12.336888210144936]
Self-supervised monocular depth estimation has been widely studied recently.
We investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation.
arXiv Detail & Related papers (2022-05-23T06:56:25Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets.
In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset.
We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z) - Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are
Better Than One [32.01675089157679]
We propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor.
Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures.
The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth), achieves better results than previous state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-16T09:09:05Z) - On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts.
We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time.
We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.