CompletionFormer: Depth Completion with Convolutions and Vision
Transformers
- URL: http://arxiv.org/abs/2304.13030v1
- Date: Tue, 25 Apr 2023 17:59:47 GMT
- Title: CompletionFormer: Depth Completion with Convolutions and Vision
Transformers
- Authors: Zhang Youmin, Guo Xianda, Poggi Matteo, Zhu Zheng, Huang Guan,
Mattoccia Stefano
- Abstract summary: This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
Our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Given sparse depths and the corresponding RGB images, depth completion aims
at spatially propagating the sparse measurements throughout the whole image to
get a dense depth prediction. Despite the tremendous progress of
deep-learning-based depth completion methods, the locality of the convolutional
layer or graph model makes it hard for the network to model the long-range
relationship between pixels. While recent fully Transformer-based architecture
has reported encouraging results with the global receptive field, the
performance and efficiency gaps to the well-developed CNN models still exist
because of its deteriorative local feature details. This paper proposes a Joint
Convolutional Attention and Transformer block (JCAT), which deeply couples the
convolutional attention layer and Vision Transformer into one block, as the
basic unit to construct our depth completion model in a pyramidal structure.
This hybrid architecture naturally benefits both the local connectivity of
convolutions and the global context of the Transformer in one single model. As
a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods
on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset,
achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure
Transformer-based methods. Code is available at
\url{https://github.com/youmi-zym/CompletionFormer}.
Related papers
- SDformer: Efficient End-to-End Transformer for Depth Completion [5.864200786548098]
Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor.
Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks.
To overcome the drawbacks of CNNs, a more effective and powerful method has been presented, which is an adaptive self-attention setting sequence-to-sequence model.
arXiv Detail & Related papers (2024-09-12T15:52:08Z) - Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion
Network for Depth Completion [3.8558637038709622]
We propose a new model for depth completion based on an encoder-decoder structure.
Our model introduces two key components: the Mask-adaptive Gated Convolution architecture and the Bi-directional Progressive Fusion module.
We achieve remarkable performance in completing depth maps and outperformed existing approaches in terms of accuracy and reliability.
arXiv Detail & Related papers (2024-01-15T02:58:06Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical
Image Segmentation [11.190117191084175]
This paper proposes TransDeepLab, a novel DeepLab-like pure Transformer for medical image segmentation.
We exploit hierarchical Swin-Transformer with shifted windows to extend the DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module.
Our approach performs superior or on par with most contemporary works on an amalgamation of Vision Transformer and CNN-based methods.
arXiv Detail & Related papers (2022-08-01T09:53:53Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are
Better Than One [32.01675089157679]
We propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor.
Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures.
The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth), achieves better results than previous state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-16T09:09:05Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.