CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow
- URL: http://arxiv.org/abs/2211.10408v3
- Date: Fri, 18 Aug 2023 15:06:20 GMT
- Title: CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow
- Authors: Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon,
Vaibhav Arora, Romain Br\'egier, Gabriela Csurka, Leonid Antsfeld, Boris
Chidlovskii, J\'er\^ome Revaud
- Abstract summary: Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow.
We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene.
We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
- Score: 22.161967080759993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite impressive performance for high-level downstream tasks,
self-supervised pre-training methods have not yet fully delivered on dense
geometric vision tasks such as stereo matching or optical flow. The application
of self-supervised concepts, such as instance discrimination or masked image
modeling, to geometric tasks is an active area of research. In this work, we
build on the recent cross-view completion framework, a variation of masked
image modeling that leverages a second view from the same scene which makes it
well suited for binocular downstream tasks. The applicability of this concept
has so far been limited in at least two ways: (a) by the difficulty of
collecting real-world image pairs -- in practice only synthetic data have been
used -- and (b) by the lack of generalization of vanilla transformers to dense
downstream tasks for which relative position is more meaningful than absolute
position. We explore three avenues of improvement. First, we introduce a method
to collect suitable real-world image pairs at large scale. Second, we
experiment with relative positional embeddings and show that they enable vision
transformers to perform substantially better. Third, we scale up vision
transformer based cross-completion architectures, which is made possible by the
use of large amounts of data. With these improvements, we show for the first
time that state-of-the-art results on stereo matching and optical flow can be
reached without using any classical task-specific techniques like correlation
volume, iterative estimation, image warping or multi-scale reasoning, thus
paving the way towards universal vision models.
Related papers
- Cross-domain and Cross-dimension Learning for Image-to-Graph
Transformers [50.576354045312115]
Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model.
We introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers.
We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D.
arXiv Detail & Related papers (2024-03-11T10:48:56Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Unifying Flow, Stereo and Depth Estimation [121.54066319299261]
We present a unified formulation and model for three motion and 3D perception tasks.
We formulate all three tasks as a unified dense correspondence matching problem.
Our model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks.
arXiv Detail & Related papers (2022-11-10T18:59:54Z) - CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion [20.121597331207276]
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
arXiv Detail & Related papers (2022-10-19T16:50:36Z) - A Visual Navigation Perspective for Category-Level Object Pose
Estimation [41.60364392204057]
This paper studies category-level object pose estimation based on a single monocular image.
Recent advances in pose-aware generative models have paved the way for addressing this challenging task using analysis-by-synthesis.
arXiv Detail & Related papers (2022-03-25T10:57:37Z) - CoSformer: Detecting Co-Salient Object with Transformers [2.3148470932285665]
Co-Salient Object Detection (CoSOD) aims at simulating the human visual system to discover the common and salient objects from a group of relevant images.
We propose the Co-Salient Object Detection Transformer (CoSformer) network to capture both salient and common visual patterns from multiple images.
arXiv Detail & Related papers (2021-04-30T02:39:12Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - Two-shot Spatially-varying BRDF and Shape Estimation [89.29020624201708]
We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF.
We create a large-scale synthetic training dataset with domain-randomized geometry and realistic materials.
Experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images.
arXiv Detail & Related papers (2020-04-01T12:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.