Efficient Context Integration through Factorized Pyramidal Learning for
Ultra-Lightweight Semantic Segmentation
- URL: http://arxiv.org/abs/2302.11785v1
- Date: Thu, 23 Feb 2023 05:34:51 GMT
- Title: Efficient Context Integration through Factorized Pyramidal Learning for
Ultra-Lightweight Semantic Segmentation
- Authors: Nadeem Atif, Saquib Mazhar, Debajit Sarma, M. K. Bhuyan and Shaik Rafi
Ahamed
- Abstract summary: We propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner.
We decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect.
Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off.
- Score: 1.0499611180329804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation is a pixel-level prediction task to classify each pixel
of the input image. Deep learning models, such as convolutional neural networks
(CNNs), have been extremely successful in achieving excellent performances in
this domain. However, mobile application, such as autonomous driving, demand
real-time processing of incoming stream of images. Hence, achieving efficient
architectures along with enhanced accuracy is of paramount importance. Since,
accuracy and model size of CNNs are intrinsically contentious in nature, the
challenge is to achieve a decent trade-off between accuracy and model size. To
address this, we propose a novel Factorized Pyramidal Learning (FPL) module to
aggregate rich contextual information in an efficient manner. On one hand, it
uses a bank of convolutional filters with multiple dilation rates which leads
to multi-scale context aggregation; crucial in achieving better accuracy. On
the other hand, parameters are reduced by a careful factorization of the
employed filters; crucial in achieving lightweight models. Moreover, we
decompose the spatial pyramid into two stages which enables a simple and
efficient feature fusion within the module to solve the notorious checkerboard
effect. We also design a dedicated Feature-Image Reinforcement (FIR) unit to
carry out the fusion operation of shallow and deep features with the
downsampled versions of the input image. This gives an accuracy enhancement
without increasing model parameters. Based on the FPL module and FIR unit, we
propose an ultra-lightweight real-time network, called FPLNet, which achieves
state-of-the-art accuracy-efficiency trade-off. More specifically, with only
less than 0.5 million parameters, the proposed network achieves 66.93\% and
66.28\% mIoU on Cityscapes validation and test set, respectively. Moreover,
FPLNet has a processing speed of 95.5 frames per second (FPS).
Related papers
- LeRF: Learning Resampling Function for Adaptive and Efficient Image Interpolation [64.34935748707673]
Recent deep neural networks (DNNs) have made impressive progress in performance by introducing learned data priors.
We propose a novel method of Learning Resampling (termed LeRF) which takes advantage of both the structural priors learned by DNNs and the locally continuous assumption.
LeRF assigns spatially varying resampling functions to input image pixels and learns to predict the shapes of these resampling functions with a neural network.
arXiv Detail & Related papers (2024-07-13T16:09:45Z) - SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in
Spiking Transformer [12.717450255837178]
Spiking Neural Network (SNN) has the advantages of low power consumption and high energy efficiency.
The most advanced SNN, Spikformer, combines the self-attention module from Transformer with SNN to achieve remarkable performance.
We present SparseSpikformer, a co-design framework aimed at achieving sparsity in Spikformer through token and weight pruning techniques.
arXiv Detail & Related papers (2023-11-15T09:22:52Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - RingMo-lite: A Remote Sensing Multi-task Lightweight Network with
CNN-Transformer Hybrid Framework [15.273362355253779]
This paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework to optimize the interpretation process.
The proposed RingMo-lite reduces the parameters over 60% in various RS image interpretation tasks, the average accuracy drops by less than 2% in most of the scenes and achieves SOTA performance compared to models of the similar size.
arXiv Detail & Related papers (2023-09-16T14:15:59Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining.
A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery.
Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale
Context Aggregation and Feature Space Super-resolution [14.226301825772174]
We introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP)
It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information.
We achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card.
arXiv Detail & Related papers (2020-03-09T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.