Related papers: Playing to Vision Foundation Model's Strengths in Stereo Matching

Playing to Vision Foundation Model's Strengths in Stereo Matching

URL: http://arxiv.org/abs/2404.06261v1
Date: Tue, 9 Apr 2024 12:34:28 GMT
Title: Playing to Vision Foundation Model's Strengths in Stereo Matching
Authors: Chuang-Wei Liu, Qijun Chen, Rui Fan,
Abstract summary: This study serves as the first exploration of a viable approach for adapting vision foundation models (VFMs) to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. ViTAStereo outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels.
Score: 13.887661472501618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.

Related papers

FoundationStereo: Zero-Shot Stereo Matching [50.79202911274819]
FoundationStereo is a foundation model for stereo depth estimation. We first construct a large-scale (1M stereo pairs) synthetic training dataset. We then design a number of network architecture components to enhance scalability.
arXiv Detail & Related papers (2025-01-17T01:01:44Z)
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z)
All-in-One: Transferring Vision Foundation Models into Stereo Matching [13.781452399651887]
AIO-Stereo can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. We show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1st$ on the Middlebury dataset.
arXiv Detail & Related papers (2024-12-13T06:59:17Z)
Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks [15.456986824737067]
Stereo matching task relies on expensive airborne LiDAR data. In this paper, we study key training factors from three perspectives. We present an unsupervised stereo matching network with good generalization performance.
arXiv Detail & Related papers (2024-08-14T15:26:10Z)
Global-Local Progressive Integration Network for Blind Image Quality Assessment [6.095342999639137]
Vision transformers (ViTs) excel in computer vision for modeling long-term dependencies, yet face two key challenges for image quality assessment (IQA) We propose a Global-Local progressive INTegration network for IQA, called GlintIQA, to address these issues through three key components.
arXiv Detail & Related papers (2024-08-07T16:34:32Z)
MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection [19.8309983660935]
MsSVT++ is an innovative Mixed-scale Sparse Voxel Transformer. It simultaneously captures both types of information through a divide-and-conquer approach. MsSVT++ consistently delivers exceptional performance across diverse datasets.
arXiv Detail & Related papers (2024-01-22T06:42:23Z)
Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm. FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z)
Multi-scale Alternated Attention Transformer for Generalized Stereo Matching [7.493797166406228]
We present a simple but highly effective network called Alternated Attention U-shaped Transformer (AAUformer) to balance the impact of epipolar line in dual and single view. Compared to other models, our model has several main designs. We performed a series of both comparative studies and ablation studies on several mainstream stereo matching datasets.
arXiv Detail & Related papers (2023-08-06T08:22:39Z)
Unifying Flow, Stereo and Depth Estimation [121.54066319299261]
We present a unified formulation and model for three motion and 3D perception tasks. We formulate all three tasks as a unified dense correspondence matching problem. Our model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks.
arXiv Detail & Related papers (2022-11-10T18:59:54Z)
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z)
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection [39.37861288287621]
A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario. A random compact convolutional stem supplants the pre-trained large kernel patchify stem. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on.
arXiv Detail & Related papers (2022-04-06T17:59:04Z)
Revisiting Domain Generalized Stereo Matching Networks from a Feature Consistency Perspective [65.37571681370096]
We propose a simple pixel-wise contrastive learning across the viewpoints. A stereo selective whitening loss is introduced to better preserve the stereo feature consistency across domains. Our method achieves superior performance over several state-of-the-art networks.
arXiv Detail & Related papers (2022-03-21T11:21:41Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
SMD-Nets: Stereo Mixture Density Networks [68.56947049719936]
We propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures. Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities. We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets.
arXiv Detail & Related papers (2021-04-08T16:15:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.