P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic
Segmentation
- URL: http://arxiv.org/abs/2310.15025v1
- Date: Mon, 23 Oct 2023 15:23:31 GMT
- Title: P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic
Segmentation
- Authors: Mohammed A. M. Elhassan, Changjun Zhou, Amina Benabid, Abuzar B. M.
Adam
- Abstract summary: We propose a real-time semantic segmentation architecture named Pyramid Pooling Axial Transformer (P2AT)
The proposed P2AT takes a coarse feature from the CNN encoder to produce scale-aware contextual features.
We evaluate P2AT variants on three challenging scene-understanding datasets.
- Score: 1.1470070927586018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Transformer-based models have achieved promising results in various
vision tasks, due to their ability to model long-range dependencies. However,
transformers are computationally expensive, which limits their applications in
real-time tasks such as autonomous driving. In addition, an efficient local and
global feature selection and fusion are vital for accurate dense prediction,
especially driving scene understanding tasks. In this paper, we propose a
real-time semantic segmentation architecture named Pyramid Pooling Axial
Transformer (P2AT). The proposed P2AT takes a coarse feature from the CNN
encoder to produce scale-aware contextual features, which are then combined
with the multi-level feature aggregation scheme to produce enhanced contextual
features. Specifically, we introduce a pyramid pooling axial transformer to
capture intricate spatial and channel dependencies, leading to improved
performance on semantic segmentation. Then, we design a Bidirectional Fusion
module (BiF) to combine semantic information at different levels. Meanwhile, a
Global Context Enhancer is introduced to compensate for the inadequacy of
concatenating different semantic levels. Finally, a decoder block is proposed
to help maintain a larger receptive field. We evaluate P2AT variants on three
challenging scene-understanding datasets. In particular, our P2AT variants
achieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for
P2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on
Cityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed
architecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes.
The source code will be available at
Related papers
- HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation [11.334990474402915]
We introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers.
HAFormer achieves high performance with minimal computational overhead and compact model size.
arXiv Detail & Related papers (2024-07-10T07:53:24Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub.
It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations.
Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - S$^2$-FPN: Scale-ware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation [6.744210626403423]
This paper presents a new model to achieve a trade-off between accuracy/speed for real-time road scene semantic segmentation.
Specifically, we proposed a lightweight model named Scale-aware Strip Attention Guided Feature Pyramid Network (S$2$-FPN)
Our network consists of three main modules: Attention Pyramid Fusion (APF) module, Scale-aware Strip Attention Module (SSAM), and Global Feature Upsample (GFU) module.
arXiv Detail & Related papers (2022-06-15T05:02:49Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - P2T: Pyramid Pooling Transformer for Scene Understanding [62.41912463252468]
We build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
arXiv Detail & Related papers (2021-06-22T18:28:52Z) - TransVOS: Video Object Segmentation with Transformers [13.311777431243296]
We propose a vision transformer to fully exploit and model both the temporal and spatial relationships.
To slim the popular two-encoder pipeline, we design a single two-path feature extractor.
Experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets.
arXiv Detail & Related papers (2021-06-01T15:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.