CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion
- URL: http://arxiv.org/abs/2210.10716v1
- Date: Wed, 19 Oct 2022 16:50:36 GMT
- Title: CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion
- Authors: Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Romain Br\'egier,
Yohann Cabon, Vaibhav Arora, Leonid Antsfeld, Boris Chidlovskii, Gabriela
Csurka, J\'er\^ome Revaud
- Abstract summary: Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
- Score: 20.121597331207276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Image Modeling (MIM) has recently been established as a potent
pre-training paradigm. A pretext task is constructed by masking patches in an
input image, and this masked content is then predicted by a neural network
using visible patches as sole input. This pre-training leads to
state-of-the-art performance when finetuned for high-level semantic tasks, e.g.
image classification and object detection. In this paper we instead seek to
learn representations that transfer well to a wide variety of 3D vision and
lower-level geometric downstream tasks, such as depth prediction or optical
flow estimation. Inspired by MIM, we propose an unsupervised representation
learning task trained from pairs of images showing the same scene from
different viewpoints. More precisely, we propose the pretext task of cross-view
completion where the first input image is partially masked, and this masked
content has to be reconstructed from the visible content and the second image.
In single-view MIM, the masked content often cannot be inferred precisely from
the visible portion only, so the model learns to act as a prior influenced by
high-level semantics. In contrast, this ambiguity can be resolved with
cross-view completion from the second unmasked image, on the condition that the
model is able to understand the spatial relationship between the two images.
Our experiments show that our pretext task leads to significantly improved
performance for monocular 3D vision downstream tasks such as depth estimation.
In addition, our model can be directly applied to binocular downstream tasks
like optical flow or relative camera pose estimation, for which we obtain
competitive results without bells and whistles, i.e., using a generic
architecture without any task-specific design.
Related papers
- One Diffusion to Generate Them All [54.82732533013014]
OneDiffusion is a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding.
It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps.
OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs.
arXiv Detail & Related papers (2024-11-25T12:11:05Z) - Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining [41.145598142457686]
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications.
We propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames.
Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets.
arXiv Detail & Related papers (2024-07-10T08:46:29Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow.
We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene.
We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z) - Siamese Image Modeling for Self-Supervised Vision Representation
Learning [73.78790119050056]
Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks.
Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM)
We propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view.
arXiv Detail & Related papers (2022-06-02T17:59:58Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - LiP-Flow: Learning Inference-time Priors for Codec Avatars via
Normalizing Flows in Latent Space [90.74976459491303]
We introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space.
A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective.
We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
arXiv Detail & Related papers (2022-03-15T13:22:57Z) - Shelf-Supervised Mesh Prediction in the Wild [54.01373263260449]
We propose a learning-based approach to infer 3D shape and pose of object from a single image.
We first infer a volumetric representation in a canonical frame, along with the camera pose.
The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame.
arXiv Detail & Related papers (2021-02-11T18:57:10Z) - Object Detection on Single Monocular Images through Canonical
Correlation Analysis [3.4722706398428493]
We retrieve 3-D object information from single monocular images without using extra 3-D data like points cloud or depth images.
We propose a two-dimensional CCA framework to fuse monocular images and corresponding predicted depth images for basic computer vision tasks.
arXiv Detail & Related papers (2020-02-13T05:03:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.