Visual Autoregressive Modelling for Monocular Depth Estimation
- URL: http://arxiv.org/abs/2512.22653v1
- Date: Sat, 27 Dec 2025 17:08:03 GMT
- Title: Visual Autoregressive Modelling for Monocular Depth Estimation
- Authors: Amir El-Ghoussani, André Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis,
- Abstract summary: We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
- Score: 69.01449528371916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".
Related papers
- ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization [13.916180996567128]
Visual Autoregressive( VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages.<n>We present a novel optimization framework for VAR models that fundamentally differs from prior approaches.<n>Our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details.
arXiv Detail & Related papers (2026-02-26T12:36:56Z) - Keyframe-Based Feed-Forward Visual Odometry [13.646685343885556]
Current foundation model based methods typically process raw image sequences indiscriminately.<n>We propose a novel feed-forward VO method that employs reinforcement learning to derive an adaptive visual policy in a data-driven manner.<n> Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
arXiv Detail & Related papers (2026-01-22T14:45:42Z) - Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z) - Nonparametric Data Attribution for Diffusion Models [57.820618036556084]
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs.<n>We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images.
arXiv Detail & Related papers (2025-10-16T03:37:16Z) - Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting [95.61137026932062]
Intern-GS is a novel approach to enhance the process of sparse-view Gaussian splatting.<n>We show that Intern-GS achieves state-of-the-art rendering quality across diverse datasets.
arXiv Detail & Related papers (2025-05-27T05:17:49Z) - MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model [2.0624236247076397]
This study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation.<n>It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner.<n>The proposed model outperforms recent state-of-the-art methods, as demonstrated through evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments.
arXiv Detail & Related papers (2025-02-01T04:37:13Z) - Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z) - ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions [91.55655961014027]
3D semantic occupancy and flow prediction are fundamental to understanding scene scene.<n>This paper proposes a vision-based framework with three targeted improvements.<n>Our purely convolutional architecture establishes new SOTA performance on multiple benchmarks for both semantic occupancy and joint semantic-flow prediction.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - DepthART: Monocular Depth Estimation as Autoregressive Refinement Task [2.3884184860468136]
We introduce DepthART - a novel training method formulated as a Depth Autoregressive Refinement Task.<n>By utilizing the model's own predictions as inputs, we frame the objective as residual minimization, effectively reducing the discrepancy between training and inference procedures.<n>When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines.
arXiv Detail & Related papers (2024-09-23T13:36:34Z) - Enhancing Generalization in Medical Visual Question Answering Tasks via
Gradient-Guided Model Perturbation [16.22199565010318]
We introduce a method that incorporates gradient-guided perturbations to the visual encoder of the multimodality model during both pre-training and fine-tuning phases.
The results show that, even with a significantly smaller pre-training image caption dataset, our approach achieves competitive outcomes.
arXiv Detail & Related papers (2024-03-05T06:57:37Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.