Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV
- URL: http://arxiv.org/abs/2307.10713v1
- Date: Thu, 20 Jul 2023 09:13:32 GMT
- Title: Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV
- Authors: Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden
- Abstract summary: Self-supervised monocular depth estimation (SS-MDE) has the potential to scale to vast quantities of data.
We propose a large-scale SlowTV dataset curated from YouTube, containing an order of magnitude more data than existing automotive datasets.
We train an SS-MDE model that provides zero-shot generalization to a large collection of indoor/outdoor datasets.
- Score: 68.31957280416347
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised monocular depth estimation (SS-MDE) has the potential to
scale to vast quantities of data. Unfortunately, existing approaches limit
themselves to the automotive domain, resulting in models incapable of
generalizing to complex environments such as natural or indoor settings.
To address this, we propose a large-scale SlowTV dataset curated from
YouTube, containing an order of magnitude more data than existing automotive
datasets. SlowTV contains 1.7M images from a rich diversity of environments,
such as worldwide seasonal hiking, scenic driving and scuba diving. Using this
dataset, we train an SS-MDE model that provides zero-shot generalization to a
large collection of indoor/outdoor datasets. The resulting model outperforms
all existing SSL approaches and closes the gap on supervised SoTA, despite
using a more efficient architecture.
We additionally introduce a collection of best-practices to further maximize
performance and zero-shot generalization. This includes 1) aspect ratio
augmentation, 2) camera intrinsic estimation, 3) support frame randomization
and 4) flexible motion estimation. Code is available at
https://github.com/jspenmar/slowtv_monodepth.
Related papers
- SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model [72.0795843450604]
Current approaches face challenges in maintaining consistent accuracy across diverse scenes.
These methods rely on extensive datasets comprising millions, if not tens of millions, of data for training.
This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes.
arXiv Detail & Related papers (2024-03-13T14:08:25Z) - Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV &
CribsTV [50.616892315086574]
This paper proposes two novel datasets: SlowTV and CribsTV.
These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames.
We leverage these datasets to tackle the challenging task of zero-shot generalization.
arXiv Detail & Related papers (2024-03-03T17:29:03Z) - Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [87.61900472933523]
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation.
We scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data.
We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos.
arXiv Detail & Related papers (2024-01-19T18:59:52Z) - Seer: Language Instructed Video Prediction with Latent Diffusion Models [43.708550061909754]
Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.
We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis.
With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
arXiv Detail & Related papers (2023-03-27T03:12:24Z) - SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous
Driving [94.11868795445798]
We release a Large-Scale Object Detection benchmark for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories.
To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes.
We provide extensive experiments and deep analyses of existing supervised state-of-the-art detection models, popular self-supervised and semi-supervised approaches, and some insights about how to develop future models.
arXiv Detail & Related papers (2021-06-21T13:55:57Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.