GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular
Multi-Frame Depth Estimation
- URL: http://arxiv.org/abs/2309.17059v2
- Date: Tue, 5 Dec 2023 03:22:24 GMT
- Title: GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular
Multi-Frame Depth Estimation
- Authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Zheyuan Zhou, Kerui
Hu
- Abstract summary: We propose an efficient component for cue fusion in monocular multi-frame depth estimation.
We represent scene attributes in the form of super tokens without relying on precise shapes.
Our method achieves state-of-the-art performance on the KITTI dataset with efficient fusion speed.
- Score: 7.158264965010546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth estimation provides an alternative approach for perceiving 3D
information in autonomous driving. Monocular depth estimation, whether with
single-frame or multi-frame inputs, has achieved significant success by
learning various types of cues and specializing in either static or dynamic
scenes. Recently, these cues fusion becomes an attractive topic, aiming to
enable the combined cues to perform well in both types of scenes. However,
adaptive cue fusion relies on attention mechanisms, where the quadratic
complexity limits the granularity of cue representation. Additionally, explicit
cue fusion depends on precise segmentation, which imposes a heavy burden on
mask prediction. To address these issues, we propose the GSDC Transformer, an
efficient and effective component for cue fusion in monocular multi-frame depth
estimation. We utilize deformable attention to learn cue relationships at a
fine scale, while sparse attention reduces computational requirements when
granularity increases. To compensate for the precision drop in dynamic scenes,
we represent scene attributes in the form of super tokens without relying on
precise shapes. Within each super token attributed to dynamic scenes, we gather
its relevant cues and learn local dense relationships to enhance cue fusion.
Our method achieves state-of-the-art performance on the KITTI dataset with
efficient fusion speed.
Related papers
- CDXFormer: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory [3.119836924407993]
We propose CDXFormer, with a core component that is a powerful XLSTM-based spatial enhancement layer.
We introduce a scale-specific Feature Enhancer layer, incorporating a Cross-Temporal Global Perceptron customized for semantic-accurate deep features.
We also propose a Cross-Scale Interactive Fusion module to progressively interact global change representations with responses.
arXiv Detail & Related papers (2024-11-12T15:22:14Z) - Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer [12.486504395099022]
Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data.
Lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately.
We introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions.
arXiv Detail & Related papers (2024-06-13T08:51:57Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - GraFT: Gradual Fusion Transformer for Multimodal Re-Identification [0.8999666725996975]
We introduce the textbfGradual Fusion Transformer (GraFT) for multimodal ReID.
GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features.
We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks.
arXiv Detail & Related papers (2023-10-25T00:15:40Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking [51.16677396148247]
Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames.
In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues.
Our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack.
arXiv Detail & Related papers (2023-08-01T18:53:24Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.