Related papers: CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

URL: http://arxiv.org/abs/2210.10716v1
Date: Wed, 19 Oct 2022 16:50:36 GMT
Title: CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion
Authors: Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Romain Br\'egier, Yohann Cabon, Vaibhav Arora, Leonid Antsfeld, Boris Chidlovskii, Gabriela Csurka, J\'er\^ome Revaud
Abstract summary: Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks. Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
Score: 20.121597331207276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. A pretext task is constructed by masking patches in an input image, and this masked content is then predicted by a neural network using visible patches as sole input. This pre-training leads to state-of-the-art performance when finetuned for high-level semantic tasks, e.g. image classification and object detection. In this paper we instead seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks, such as depth prediction or optical flow estimation. Inspired by MIM, we propose an unsupervised representation learning task trained from pairs of images showing the same scene from different viewpoints. More precisely, we propose the pretext task of cross-view completion where the first input image is partially masked, and this masked content has to be reconstructed from the visible content and the second image. In single-view MIM, the masked content often cannot be inferred precisely from the visible portion only, so the model learns to act as a prior influenced by high-level semantics. In contrast, this ambiguity can be resolved with cross-view completion from the second unmasked image, on the condition that the model is able to understand the spatial relationship between the two images. Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks such as depth estimation. In addition, our model can be directly applied to binocular downstream tasks like optical flow or relative camera pose estimation, for which we obtain competitive results without bells and whistles, i.e., using a generic architecture without any task-specific design.

Related papers

UniForward: Unified 3D Scene and Semantic Field Reconstruction via Feed-Forward Gaussian Splatting from Only Sparse-View Images [43.40816438003861]
We propose a feed-forward model that unifies 3D scene and semantic field reconstruction.<n>Our UniForward can reconstruct 3D scenes and the corresponding semantic fields in real time from only sparse-view images.<n> Experiments on novel view synthesis and novel view segmentation demonstrate that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2025-06-11T04:01:21Z)
Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z)
One Diffusion to Generate Them All [54.82732533013014]
OneDiffusion is a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps. OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs.
arXiv Detail & Related papers (2024-11-25T12:11:05Z)
Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining [41.145598142457686]
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications. We propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames. Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets.
arXiv Detail & Related papers (2024-07-10T08:46:29Z)
RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z)
Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z)
CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow. We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene. We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z)
Siamese Image Modeling for Self-Supervised Vision Representation Learning [73.78790119050056]
Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM) We propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view.
arXiv Detail & Related papers (2022-06-02T17:59:58Z)
The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training. Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z)
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks. In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents' We summarize a few design principles for token-based pre-training of vision transformers. This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z)
LiP-Flow: Learning Inference-time Priors for Codec Avatars via Normalizing Flows in Latent Space [90.74976459491303]
We introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space. A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective. We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
arXiv Detail & Related papers (2022-03-15T13:22:57Z)
Shelf-Supervised Mesh Prediction in the Wild [54.01373263260449]
We propose a learning-based approach to infer 3D shape and pose of object from a single image. We first infer a volumetric representation in a canonical frame, along with the camera pose. The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame.
arXiv Detail & Related papers (2021-02-11T18:57:10Z)
Object Detection on Single Monocular Images through Canonical Correlation Analysis [3.4722706398428493]
We retrieve 3-D object information from single monocular images without using extra 3-D data like points cloud or depth images. We propose a two-dimensional CCA framework to fuse monocular images and corresponding predicted depth images for basic computer vision tasks.
arXiv Detail & Related papers (2020-02-13T05:03:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.