A Cross-Scale Hierarchical Transformer with Correspondence-Augmented
Attention for inferring Bird's-Eye-View Semantic Segmentation
- URL: http://arxiv.org/abs/2304.03650v2
- Date: Thu, 17 Aug 2023 08:33:34 GMT
- Title: A Cross-Scale Hierarchical Transformer with Correspondence-Augmented
Attention for inferring Bird's-Eye-View Semantic Segmentation
- Authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Kerui Hu, Kang Wang
- Abstract summary: Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing.
We propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring.
Our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.
- Score: 13.013635162859108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As bird's-eye-view (BEV) semantic segmentation is simple-to-visualize and
easy-to-handle, it has been applied in autonomous driving to provide the
surrounding information to downstream tasks. Inferring BEV semantic
segmentation conditioned on multi-camera-view images is a popular scheme in the
community as cheap devices and real-time processing. The recent work
implemented this task by learning the content and position relationship via the
vision Transformer (ViT). However, the quadratic complexity of ViT confines the
relationship learning only in the latent layer, leaving the scale gap to impede
the representation of fine-grained objects. And their plain fusion method of
multi-view features does not conform to the information absorption intention in
representing BEV features. To tackle these issues, we propose a novel
cross-scale hierarchical Transformer with correspondence-augmented attention
for semantic segmentation inferring. Specifically, we devise a hierarchical
framework to refine the BEV feature representation, where the last size is only
half of the final segmentation. To save the computation increase caused by this
hierarchical framework, we exploit the cross-scale Transformer to learn feature
relationships in a reversed-aligning way, and leverage the residual connection
of BEV features to facilitate information transmission between scales. We
propose correspondence-augmented attention to distinguish conducive and
inconducive correspondences. It is implemented in a simple yet effective way,
amplifying attention scores before the Softmax operation, so that the
position-view-related and the position-view-disrelated attention scores are
highlighted and suppressed. Extensive experiments demonstrate that our method
has state-of-the-art performance in inferring BEV semantic segmentation
conditioned on multi-camera-view images.
Related papers
- OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems.
We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance.
Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z) - Semi-Supervised Learning for Visual Bird's Eye View Semantic
Segmentation [16.3996408206659]
We present a novel semi-supervised framework for visual BEV semantic segmentation to boost performance by exploiting unlabeled images during the training.
A consistency loss that makes full use of unlabeled data is then proposed to constrain the model on not only semantic prediction but also the BEV feature.
Experiments on the nuScenes and Argoverse datasets show that our framework can effectively improve prediction accuracy.
arXiv Detail & Related papers (2023-08-28T12:23:36Z) - X-Align++: cross-modal cross-view alignment for Bird's-eye-view
segmentation [44.58686493878629]
X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation.
X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes and KITTI-360 datasets.
arXiv Detail & Related papers (2023-06-06T15:52:55Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision
Transformer for Face Forgery Detection [52.91782218300844]
We propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT.
Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning.
arXiv Detail & Related papers (2022-10-23T15:24:47Z) - X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View
Segmentation [44.95630790801856]
X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation.
X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes.
arXiv Detail & Related papers (2022-10-13T06:42:46Z) - ViT-BEVSeg: A Hierarchical Transformer Network for Monocular
Birds-Eye-View Segmentation [2.70519393940262]
We evaluate the use of vision transformers (ViT) as a backbone architecture to generate Bird Eye View (BEV) maps.
Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image.
We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement relative to state-of-the-art approaches.
arXiv Detail & Related papers (2022-05-31T10:18:36Z) - GitNet: Geometric Prior-based Transformation for Birds-Eye-View
Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving.
We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary
Camera Rigs [3.5728676902207988]
We present an effective transformer-based method for BEV semantic segmentation from arbitrary camera rigs.
Specifically, our method first encodes image features from arbitrary cameras with a shared backbone.
An efficient multi-camera deformable attention unit is designed to carry out the BEV-to-image view transformation.
arXiv Detail & Related papers (2022-03-08T12:39:51Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.