MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
- URL: http://arxiv.org/abs/2409.02846v1
- Date: Wed, 4 Sep 2024 16:17:45 GMT
- Title: MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
- Authors: Jihye Ahn, Hyesong Choi, Soomin Kim, Dongbo Min,
- Abstract summary: Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task.
We propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model.
- Score: 18.02254687807291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.
Related papers
- UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching [18.02254687807291]
UniTT-Stereo is a method to maximize the potential of Transformer-based stereo architectures.
State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets.
arXiv Detail & Related papers (2024-09-04T09:02:01Z) - StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models [2.9260206957981167]
We introduce StereoDiffusion, a method that is trainning free, remarkably straightforward to use, and seamlessly integrates into the original Stable Diffusion model.
Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs.
Our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
arXiv Detail & Related papers (2024-03-08T00:30:25Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Unifying Flow, Stereo and Depth Estimation [121.54066319299261]
We present a unified formulation and model for three motion and 3D perception tasks.
We formulate all three tasks as a unified dense correspondence matching problem.
Our model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks.
arXiv Detail & Related papers (2022-11-10T18:59:54Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Active-Passive SimStereo -- Benchmarking the Cross-Generalization
Capabilities of Deep Learning-based Stereo Methods [26.662129158141763]
Self-similar or bland regions can make it difficult to match patches between two images.
Active stereo-based methods mitigate this problem by projecting a pseudo-random pattern on the scene.
If this pattern acts as a form of adversarial noise, it could negatively impact the performance of deep learning-based methods.
arXiv Detail & Related papers (2022-09-17T10:30:32Z) - PointFix: Learning to Fix Domain Bias for Robust Online Stereo
Adaptation [67.41325356479229]
We propose to incorporate an auxiliary point-selective network into a meta-learning framework, called PointFix.
In a nutshell, our auxiliary network learns to fix local variants intensively by effectively back-propagating local information through the meta-gradient.
This network is model-agnostic, so can be used in any kind of architectures in a plug-and-play manner.
arXiv Detail & Related papers (2022-07-27T07:48:29Z) - SaiNet: Stereo aware inpainting behind objects with generative networks [21.35917056958527]
We present an end-to-end network for stereo-consistent image inpainting with the objective of inpainting large missing regions behind objects.
The proposed model consists of an edge-guided UNet-like network using Partial Convolutions.
arXiv Detail & Related papers (2022-05-14T09:07:30Z) - SMD-Nets: Stereo Mixture Density Networks [68.56947049719936]
We propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures.
Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities.
We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets.
arXiv Detail & Related papers (2021-04-08T16:15:46Z) - Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations [71.00754846434744]
We show that imperceptible additive perturbations can significantly alter the disparity map.
We show that, when used for adversarial data augmentation, our perturbations result in trained models that are more robust.
arXiv Detail & Related papers (2020-09-21T19:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.