Playing to Vision Foundation Model's Strengths in Stereo Matching
- URL: http://arxiv.org/abs/2404.06261v1
- Date: Tue, 9 Apr 2024 12:34:28 GMT
- Title: Playing to Vision Foundation Model's Strengths in Stereo Matching
- Authors: Chuang-Wei Liu, Qijun Chen, Rui Fan,
- Abstract summary: This study serves as the first exploration of a viable approach for adapting vision foundation models (VFMs) to stereo matching.
Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention.
ViTAStereo outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels.
- Score: 13.887661472501618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.
Related papers
- FoundationStereo: Zero-Shot Stereo Matching [50.79202911274819]
FoundationStereo is a foundation model for stereo depth estimation.
We first construct a large-scale (1M stereo pairs) synthetic training dataset.
We then design a number of network architecture components to enhance scalability.
arXiv Detail & Related papers (2025-01-17T01:01:44Z) - LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - All-in-One: Transferring Vision Foundation Models into Stereo Matching [13.781452399651887]
AIO-Stereo can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model.
We show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1st$ on the Middlebury dataset.
arXiv Detail & Related papers (2024-12-13T06:59:17Z) - Global-Local Progressive Integration Network for Blind Image Quality Assessment [6.095342999639137]
Vision transformers (ViTs) excel in computer vision for modeling long-term dependencies, yet face two key challenges for image quality assessment (IQA)
We propose a Global-Local progressive INTegration network for IQA, called GlintIQA, to address these issues through three key components.
arXiv Detail & Related papers (2024-08-07T16:34:32Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Multi-scale Alternated Attention Transformer for Generalized Stereo
Matching [7.493797166406228]
We present a simple but highly effective network called Alternated Attention U-shaped Transformer (AAUformer) to balance the impact of epipolar line in dual and single view.
Compared to other models, our model has several main designs.
We performed a series of both comparative studies and ablation studies on several mainstream stereo matching datasets.
arXiv Detail & Related papers (2023-08-06T08:22:39Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Unleashing Vanilla Vision Transformer with Masked Image Modeling for
Object Detection [39.37861288287621]
A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario.
A random compact convolutional stem supplants the pre-trained large kernel patchify stem.
The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on.
arXiv Detail & Related papers (2022-04-06T17:59:04Z) - Revisiting Domain Generalized Stereo Matching Networks from a Feature
Consistency Perspective [65.37571681370096]
We propose a simple pixel-wise contrastive learning across the viewpoints.
A stereo selective whitening loss is introduced to better preserve the stereo feature consistency across domains.
Our method achieves superior performance over several state-of-the-art networks.
arXiv Detail & Related papers (2022-03-21T11:21:41Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.