CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer
- URL: http://arxiv.org/abs/2312.08594v2
- Date: Fri, 2 Feb 2024 01:41:21 GMT
- Title: CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer
- Authors: Sicheng Wang, Hao Jiang, Lei Xiang
- Abstract summary: Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
- Score: 8.962657021133925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent deep multi-view stereo (MVS) methods have widely incorporated
transformers into cascade network for high-resolution depth estimation,
achieving impressive results. However, existing transformer-based methods are
constrained by their computational costs, preventing their extension to finer
stages. In this paper, we propose a novel cross-scale transformer (CT) that
processes feature representations at different stages without additional
computation. Specifically, we introduce an adaptive matching-aware transformer
(AMT) that employs different interactive attention combinations at multiple
scales. This combined strategy enables our network to capture intra-image
context information and enhance inter-image feature relationships. Besides, we
present a dual-feature guided aggregation (DFGA) that embeds the coarse global
semantic information into the finer cost volume construction to further
strengthen global and local feature awareness. Meanwhile, we design a feature
metric loss (FM Loss) that evaluates the feature bias before and after
transformation to reduce the impact of feature mismatch on depth estimation.
Extensive experiments on DTU dataset and Tanks and Temples (T\&T) benchmark
demonstrate that our method achieves state-of-the-art results. Code is
available at https://github.com/wscstrive/CT-MVSNet.
Related papers
- Adaptive Step-size Perception Unfolding Network with Non-local Hybrid Attention for Hyperspectral Image Reconstruction [0.39134031118910273]
We propose an adaptive step-size perception unfolding network (ASPUN), a deep unfolding network based on FISTA algorithm.
In addition, we design a Non-local Hybrid Attention Transformer(NHAT) module for fully leveraging the receptive field advantage of transformer.
Experimental results show that our ASPUN is superior to the existing SOTA algorithms and achieves the best performance.
arXiv Detail & Related papers (2024-07-04T16:09:52Z) - FuseFormer: A Transformer for Visual and Thermal Image Fusion [3.6064695344878093]
We propose a novel methodology for the image fusion problem that mitigates the limitations associated with using classical evaluation metrics as loss functions.
Our approach integrates a transformer-based multi-scale fusion strategy that adeptly addresses local and global context information.
Our proposed method, along with the novel loss function definition, demonstrates superior performance compared to other competitive fusion algorithms.
arXiv Detail & Related papers (2024-02-01T19:40:39Z) - Deep Laparoscopic Stereo Matching with Transformers [46.18206008056612]
Self-attention mechanism, successfully employed with the transformer structure, is shown promise in many computer vision tasks.
We propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design.
arXiv Detail & Related papers (2022-07-25T12:54:32Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - TransMVSNet: Global Context-aware Multi-view Stereo Network with
Transformers [6.205844084751411]
We present TransMVSNet, based on our exploration of feature matching in multi-view stereo (MVS)
We propose a powerful Feature Matching Transformer (FMT) to leverage intra- (self-) and inter- (cross-) attention to aggregate long-range context information.
Our method achieves state-of-the-art performance on DTU dataset, Tanks and Temples benchmark, and BlendedMVS dataset.
arXiv Detail & Related papers (2021-11-29T15:31:49Z) - SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image
Prediction [33.29925021875922]
We propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF)
ISP explores the semantic diversity in different receptive space. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation.
Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks.
arXiv Detail & Related papers (2021-09-18T16:29:14Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.