ChiTransformer:Towards Reliable Stereo from Cues
- URL: http://arxiv.org/abs/2203.04554v4
- Date: Wed, 1 Nov 2023 03:53:10 GMT
- Title: ChiTransformer:Towards Reliable Stereo from Cues
- Authors: Qing Su, Shihao Ji
- Abstract summary: Current stereo matching techniques are challenged by restricted searching space, occluded regions and sheer size.
We present an optic-chiasm-inspired self-supervised binocular depth estimation method.
ChiTransformer architecture yields substantial improvements over state-of-the-art self-supervised stereo approaches by 11%.
- Score: 10.756828396434033
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Current stereo matching techniques are challenged by restricted searching
space, occluded regions and sheer size. While single image depth estimation is
spared from these challenges and can achieve satisfactory results with the
extracted monocular cues, the lack of stereoscopic relationship renders the
monocular prediction less reliable on its own, especially in highly dynamic or
cluttered environments. To address these issues in both scenarios, we present
an optic-chiasm-inspired self-supervised binocular depth estimation method,
wherein a vision transformer (ViT) with gated positional cross-attention (GPCA)
layers is designed to enable feature-sensitive pattern retrieval between views
while retaining the extensive context information aggregated through
self-attentions. Monocular cues from a single view are thereafter conditionally
rectified by a blending layer with the retrieved pattern pairs. This crossover
design is biologically analogous to the optic-chasma structure in the human
visual system and hence the name, ChiTransformer. Our experiments show that
this architecture yields substantial improvements over state-of-the-art
self-supervised stereo approaches by 11%, and can be used on both rectilinear
and non-rectilinear (e.g., fisheye) images. Project is available at
https://github.com/ISL-CV/ChiTransformer.
Related papers
- CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening [9.928208927136874]
We present a Vision Transformer-based bi-channel architecture, named CeViT, where the common features of a pair of eyes are extracted via a shared Transformer encoder.
We demonstrate that CeViT enhances the baseline model in both accuracy of classifying high-myopia and prediction of AL on both eyes.
arXiv Detail & Related papers (2025-01-11T13:23:56Z) - Exploring Invariant Representation for Visible-Infrared Person
Re-Identification [77.06940947765406]
Cross-spectral person re-identification, which aims to associate identities to pedestrians across different spectra, faces a main challenge of the modality discrepancy.
In this paper, we address the problem from both image-level and feature-level in an end-to-end hybrid learning framework named robust feature mining network (RFM)
Experiment results on two standard cross-spectral person re-identification datasets, RegDB and SYSU-MM01, have demonstrated state-of-the-art performance.
arXiv Detail & Related papers (2023-02-02T05:24:50Z) - CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow.
We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene.
We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z) - Multitask AET with Orthogonal Tangent Regularity for Dark Object
Detection [84.52197307286681]
We propose a novel multitask auto encoding transformation (MAET) model to enhance object detection in a dark environment.
In a self-supervision manner, the MAET learns the intrinsic visual structure by encoding and decoding the realistic illumination-degrading transformation.
We have achieved the state-of-the-art performance using synthetic and real-world datasets.
arXiv Detail & Related papers (2022-05-06T16:27:14Z) - Multi-Frame Self-Supervised Depth with Transformers [33.00363651105475]
We propose a novel transformer architecture for cost volume generation.
We use depth-discretized epipolar sampling to select matching candidates.
We refine predictions through a series of self- and cross-attention layers.
arXiv Detail & Related papers (2022-04-15T19:04:57Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Transformers in Self-Supervised Monocular Depth Estimation with Unknown
Camera Intrinsics [13.7258515433446]
Self-supervised monocular depth estimation is an important task in 3D scene understanding.
We show how to adapt vision transformers for self-supervised monocular depth estimation.
Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
arXiv Detail & Related papers (2022-02-07T13:17:29Z) - SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D.
We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image.
Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z) - Reversing the cycle: self-supervised deep stereo through enhanced
monocular distillation [51.714092199995044]
In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches.
We propose a novel self-supervised paradigm reversing the link between the two.
In order to train deep stereo networks, we distill knowledge through a monocular completion network.
arXiv Detail & Related papers (2020-08-17T07:40:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.