Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
- URL: http://arxiv.org/abs/2508.04928v1
- Date: Wed, 06 Aug 2025 23:23:20 GMT
- Title: Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
- Authors: Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong,
- Abstract summary: We propose a method to extend foundational monocular depth estimators (FMDEs) to fisheye images.<n>Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images.
- Score: 8.197905977697552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.
Related papers
- FisheyeDepth: A Real Scale Self-Supervised Depth Estimation Model for Fisheye Camera [8.502741852406904]
We present FisheyeDepth, a self-supervised depth estimation model tailored for fisheye cameras.<n>We incorporate a fisheye camera model into the projection and reprojection stages during training to handle image distortions.<n>We also incorporate real-scale pose information into the geometric projection between consecutive frames, replacing the poses estimated by the conventional pose network.
arXiv Detail & Related papers (2024-09-23T14:31:42Z) - RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation [88.54817424560056]
We propose a distortion vector map (DVM) that measures the degree and direction of local distortion.
By learning the DVM, the model can independently identify local distortions at each pixel without relying on global distortion patterns.
In the pre-training stage, it predicts the distortion vector map and perceives the local distortion features of each pixel.
In the fine-tuning stage, it predicts a pixel-wise flow map for deviated fisheye image rectification.
arXiv Detail & Related papers (2024-06-27T06:38:56Z) - RecDiffusion: Rectangling for Image Stitching with Diffusion Models [53.824503710254206]
We introduce a novel diffusion-based learning framework, textbfRecDiffusion, for image stitching rectangling.
This framework combines Motion Diffusion Models (MDM) to generate motion fields, effectively transitioning from the stitched image's irregular borders to a geometrically corrected intermediary.
arXiv Detail & Related papers (2024-03-28T06:22:45Z) - Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption [9.018416031676136]
A Manhattan world lying along cuboid buildings is useful for camera angle estimation.
We propose a learning-based calibration method that uses heatmap regression to detect the directions of labeled image coordinates.
Our method outperforms conventional methods on large-scale datasets and with off-the-shelf cameras.
arXiv Detail & Related papers (2023-03-30T05:57:59Z) - Sector Patch Embedding: An Embedding Module Conforming to The Distortion
Pattern of Fisheye Image [23.73394258521532]
We propose a novel patch embedding method called Sector Patch Embedding(SPE), conforming to the distortion pattern of the fisheye image.
The classification top-1 accuracy of ViT and PVT is improved by 0.75% and 2.8% with SPE respectively.
Our method can be easily adopted to other Transformer-based models.
arXiv Detail & Related papers (2023-03-26T07:20:02Z) - When the Sun Goes Down: Repairing Photometric Losses for All-Day Depth
Estimation [47.617222712429026]
We show how to use a combination of three techniques to allow the existing photometric losses to work for both day and nighttime images.
First, we introduce a per-pixel neural intensity transformation to compensate for the light changes that occur between successive frames.
Second, we predict a per-pixel residual flow map that we use to correct the reprojection correspondences induced by the estimated ego-motion and depth.
arXiv Detail & Related papers (2022-06-28T09:29:55Z) - FisheyeEX: Polar Outpainting for Extending the FoV of Fisheye Lens [84.12722334460022]
Fisheye lens gains increasing applications in computational photography and assisted driving because of its wide field of view (FoV)
In this paper, we present a FisheyeEX method that extends the FoV of the fisheye lens by outpainting the invalid regions.
The results demonstrate that our approach significantly outperforms the state-of-the-art methods, gaining around 27% more content beyond the original fisheye image.
arXiv Detail & Related papers (2022-06-12T21:38:50Z) - Relighting Images in the Wild with a Self-Supervised Siamese
Auto-Encoder [62.580345486483886]
We propose a self-supervised method for image relighting of single view images in the wild.
The method is based on an auto-encoder which deconstructs an image into two separate encodings.
We train our model on large-scale datasets such as Youtube 8M and CelebA.
arXiv Detail & Related papers (2020-12-11T16:08:50Z) - SIR: Self-supervised Image Rectification via Seeing the Same Scene from
Multiple Different Lenses [82.56853587380168]
We propose a novel self-supervised image rectification (SIR) method based on an important insight that the rectified results of distorted images of the same scene from different lens should be the same.
We leverage a differentiable warping module to generate the rectified images and re-distorted images from the distortion parameters.
Our method achieves comparable or even better performance than the supervised baseline method and representative state-of-the-art methods.
arXiv Detail & Related papers (2020-11-30T08:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.