DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
- URL: http://arxiv.org/abs/2409.15010v2
- Date: Fri, 25 Oct 2024 12:15:32 GMT
- Title: DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
- Authors: Bulat Gabdullin, Nina Konovalova, Nikolay Patakin, Dmitry Senushkin, Anton Konushin,
- Abstract summary: We introduce the first autoregressive depth estimation model based on the visual autoregressive transformer.
Our primary contribution is DepthART, a novel training method formulated as Depth Autoregressive Refinement Task.
Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task.
- Score: 2.3884184860468136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.
Related papers
- An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training [50.71892161377806]
DFIT-OccWorld is an efficient 3D occupancy world model that leverages decoupled dynamic flow and image-assisted training strategy.
Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation.
arXiv Detail & Related papers (2024-12-18T12:10:33Z) - DepthFM: Fast Monocular Depth Estimation with Flow Matching [22.206355073676082]
Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noise-to-depth transport.
Our method addresses these challenges by framing depth estimation as a direct transport between image and depth distributions.
Our approach achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and only requiring minimal synthetic data for training.
arXiv Detail & Related papers (2024-03-20T17:51:53Z) - Enhancing Generalization in Medical Visual Question Answering Tasks via
Gradient-Guided Model Perturbation [16.22199565010318]
We introduce a method that incorporates gradient-guided perturbations to the visual encoder of the multimodality model during both pre-training and fine-tuning phases.
The results show that, even with a significantly smaller pre-training image caption dataset, our approach achieves competitive outcomes.
arXiv Detail & Related papers (2024-03-05T06:57:37Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Neural Maximum A Posteriori Estimation on Unpaired Data for Motion
Deblurring [87.97330195531029]
We propose a Neural Maximum A Posteriori (NeurMAP) estimation framework for training neural networks to recover blind motion information and sharp content from unpaired data.
The proposed NeurMAP is an approach to existing deblurring neural networks, and is the first framework that enables training image deblurring networks on unpaired datasets.
arXiv Detail & Related papers (2022-04-26T08:09:47Z) - Learn to Adapt for Monocular Depth Estimation [17.887575611570394]
We propose an adversarial depth estimation task and train the model in the pipeline of meta-learning.
Our method adapts well to new datasets after few training steps during the test procedure.
arXiv Detail & Related papers (2022-03-26T06:49:22Z) - Improving Deep Learning Interpretability by Saliency Guided Training [36.782919916001624]
Saliency methods have been widely used to highlight important input features in model predictions.
Most existing methods use backpropagation on a modified gradient function to generate saliency maps.
We introduce a saliency guided training procedure for neural networks to reduce noisy gradients used in predictions.
arXiv Detail & Related papers (2021-11-29T06:05:23Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.