Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV &
CribsTV
- URL: http://arxiv.org/abs/2403.01569v1
- Date: Sun, 3 Mar 2024 17:29:03 GMT
- Title: Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV &
CribsTV
- Authors: Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden
- Abstract summary: This paper proposes two novel datasets: SlowTV and CribsTV.
These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames.
We leverage these datasets to tackle the challenging task of zero-shot generalization.
- Score: 50.616892315086574
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised learning is the key to unlocking generic computer vision
systems. By eliminating the reliance on ground-truth annotations, it allows
scaling to much larger data quantities. Unfortunately, self-supervised
monocular depth estimation (SS-MDE) has been limited by the absence of diverse
training data. Existing datasets have focused exclusively on urban driving in
densely populated cities, resulting in models that fail to generalize beyond
this domain.
To address these limitations, this paper proposes two novel datasets: SlowTV
and CribsTV. These are large-scale datasets curated from publicly available
YouTube videos, containing a total of 2M training frames. They offer an
incredibly diverse set of environments, ranging from snowy forests to coastal
roads, luxury mansions and even underwater coral reefs. We leverage these
datasets to tackle the challenging task of zero-shot generalization,
outperforming every existing SS-MDE approach and even some state-of-the-art
supervised methods.
The generalization capabilities of our models are further enhanced by a range
of components and contributions: 1) learning the camera intrinsics, 2) a
stronger augmentation regime targeting aspect ratio changes, 3) support frame
randomization, 4) flexible motion estimation, 5) a modern transformer-based
architecture. We demonstrate the effectiveness of each component in extensive
ablation experiments. To facilitate the development of future research, we make
the datasets, code and pretrained models available to the public at
https://github.com/jspenmar/slowtv_monodepth.
Related papers
- MegaScenes: Scene-Level View Synthesis at Scale [69.21293001231993]
Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications.
We create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world.
We analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency.
arXiv Detail & Related papers (2024-06-17T17:55:55Z) - SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model [72.0795843450604]
Current approaches face challenges in maintaining consistent accuracy across diverse scenes.
These methods rely on extensive datasets comprising millions, if not tens of millions, of data for training.
This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes.
arXiv Detail & Related papers (2024-03-13T14:08:25Z) - Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [87.61900472933523]
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation.
We scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data.
We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos.
arXiv Detail & Related papers (2024-01-19T18:59:52Z) - Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation [20.230238670888454]
We introduce Marigold, a method for affine-invariant monocular depth estimation.
It can be fine-tuned in a couple of days on a single GPU using only synthetic training data.
It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases.
arXiv Detail & Related papers (2023-12-04T18:59:13Z) - Towards Better Data Exploitation in Self-Supervised Monocular Depth
Estimation [14.262669370264994]
In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets.
Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation.
Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth.
arXiv Detail & Related papers (2023-09-11T06:18:05Z) - Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV [68.31957280416347]
Self-supervised monocular depth estimation (SS-MDE) has the potential to scale to vast quantities of data.
We propose a large-scale SlowTV dataset curated from YouTube, containing an order of magnitude more data than existing automotive datasets.
We train an SS-MDE model that provides zero-shot generalization to a large collection of indoor/outdoor datasets.
arXiv Detail & Related papers (2023-07-20T09:13:32Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.