Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation
- URL: http://arxiv.org/abs/2406.13378v1
- Date: Wed, 19 Jun 2024 09:19:06 GMT
- Title: Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation
- Authors: Zidong Cao, Jinjing Zhu, Weiming Zhang, Lin Wang,
- Abstract summary: We propose a semi-supervised learning framework to learn a 360 depth foundation model, dubbed Any360D.
Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision.
Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, showing impressive zero-shot capacity for being a 360 depth foundation model.
- Score: 19.202253857381688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Depth Anything Model (DAM) - a type of depth foundation model - reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. M\"obius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.
Related papers
- Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation [6.832852988957967]
We propose a new depth estimation framework that utilizes unlabeled 360-degree data effectively.
Our approach uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels.
We tested our approach on benchmark datasets such as Matterport3D and Stanford2D3D, showing significant improvements in depth estimation accuracy.
arXiv Detail & Related papers (2024-06-18T17:59:31Z) - Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors [51.36238367193988]
We tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM)
We present SparseSplat360, a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views.
Our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.
arXiv Detail & Related papers (2024-05-26T11:01:39Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Distortion-aware Transformer in 360{\deg} Salient Object Detection [44.74647420381127]
We propose a Transformer-based model called DATFormer to address the distortion problem.
To exploit the unique characteristics of 360deg data, we present a learnable relation matrix.
Our model outperforms existing 2D SOD (salient object detection) and 360 SOD methods.
arXiv Detail & Related papers (2023-08-07T07:28:24Z) - SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic
Segmentation [53.5256153325136]
PAnoramic Semantic (PASS) gives complete scene perception based on an ultra-wide angle of view.
Usually, prevalent PASS methods with 2D panoramic image input focus on solving image distortions but lack consideration of the 3D properties of original $360circ$ data.
We propose Spherical Geometry-Aware Transformer for PAnoramic Semantic (SGAT4PASS) to be more robust to 3D disturbance.
arXiv Detail & Related papers (2023-06-06T04:49:51Z) - NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with
360{\deg} Views [77.93662205673297]
In this work, we study the challenging task of lifting a single image to a 3D object.
We demonstrate the ability to generate a plausible 3D object with 360deg views that correspond well with a given reference image.
We propose a novel framework, dubbed NeuralLift-360, that utilizes a depth-aware radiance representation.
arXiv Detail & Related papers (2022-11-29T17:59:06Z) - View-aware Salient Object Detection for 360{\deg} Omnidirectional Image [33.43250302656753]
We construct a large scale 360deg ISOD dataset with object-level pixel-wise annotation on equirectangular projection (ERP)
Inspired by humans' observing process, we propose a view-aware salient object detection method based on a Sample Adaptive View Transformer (SAVT) module.
arXiv Detail & Related papers (2022-09-27T07:44:08Z) - SHD360: A Benchmark Dataset for Salient Human Detection in 360{\deg}
Videos [26.263614207849276]
We propose SHD360, the first 360deg video SHD dataset collecting various real-life daily scenes.
SHD360 contains 16,238 salient human instances with manually annotated pixel-wise ground truth.
Our proposed dataset and benchmark could serve as a good starting point for advancing human-centric researches towards 360deg panoramic data.
arXiv Detail & Related papers (2021-05-24T23:51:29Z) - A Fixation-based 360{\deg} Benchmark Dataset for Salient Object
Detection [21.314578493964333]
Fixation prediction (FP) in panoramic contents has been widely investigated along with the booming trend of virtual reality (VR) applications.
salient object detection (SOD) has been seldom explored in 360deg images due to the lack of datasets representative of real scenes.
arXiv Detail & Related papers (2020-01-22T11:16:39Z) - Visual Question Answering on 360{\deg} Images [96.00046925811515]
VQA 360 is a novel task of visual question answering on 360 images.
We collect the first VQA 360 dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types.
arXiv Detail & Related papers (2020-01-10T08:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.