Lightweight Real-time Semantic Segmentation Network with Efficient
Transformer and CNN
- URL: http://arxiv.org/abs/2302.10484v1
- Date: Tue, 21 Feb 2023 07:16:53 GMT
- Title: Lightweight Real-time Semantic Segmentation Network with Efficient
Transformer and CNN
- Authors: Guoan Xu, Juncheng Li, Guangwei Gao, Huimin Lu, Jian Yang, and Dong
Yue
- Abstract summary: We propose a lightweight real-time semantic segmentation network called LETNet.
LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies.
Experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance.
- Score: 34.020978009518245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the past decade, convolutional neural networks (CNNs) have shown
prominence for semantic segmentation. Although CNN models have very impressive
performance, the ability to capture global representation is still
insufficient, which results in suboptimal results. Recently, Transformer
achieved huge success in NLP tasks, demonstrating its advantages in modeling
long-range dependency. Recently, Transformer has also attracted tremendous
attention from computer vision researchers who reformulate the image processing
tasks as a sequence-to-sequence prediction but resulted in deteriorating local
feature details. In this work, we propose a lightweight real-time semantic
segmentation network called LETNet. LETNet combines a U-shaped CNN with
Transformer effectively in a capsule embedding style to compensate for
respective deficiencies. Meanwhile, the elaborately designed Lightweight
Dilated Bottleneck (LDB) module and Feature Enhancement (FE) module cultivate a
positive impact on training from scratch simultaneously. Extensive experiments
performed on challenging datasets demonstrate that LETNet achieves superior
performances in accuracy and efficiency balance. Specifically, It only contains
0.95M parameters and 13.6G FLOPs but yields 72.8\% mIoU at 120 FPS on the
Cityscapes test set and 70.5\% mIoU at 250 FPS on the CamVid test dataset using
a single RTX 3090 GPU. The source code will be available at
https://github.com/IVIPLab/LETNet.
Related papers
- CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction [14.377544481394013]
CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features.
This integration enables efficient processing of detailed local and broader contextual information.
Experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance.
arXiv Detail & Related papers (2024-10-15T09:27:26Z) - Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network [37.84039482457571]
We propose a lightweight multiple-information interaction network for real-time semantic segmentation, called LMIINet.
It effectively combines CNNs and Transformers while reducing redundant computations and memory footprint.
With only 0.72M parameters and 11.74G FLOPs, LMIINet achieves 72.0% mIoU at 100 FPS on the Cityscapes test set and 69.94% mIoU at 160 FPS on the CamVid dataset.
arXiv Detail & Related papers (2024-10-03T05:45:24Z) - HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation [11.334990474402915]
We introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers.
HAFormer achieves high performance with minimal computational overhead and compact model size.
arXiv Detail & Related papers (2024-07-10T07:53:24Z) - OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions [95.94629864981091]
This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs.
The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
arXiv Detail & Related papers (2022-11-10T18:59:04Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - Pixel Difference Networks for Efficient Edge Detection [71.03915957914532]
We propose a lightweight yet effective architecture named Pixel Difference Network (PiDiNet) for efficient edge detection.
Extensive experiments on BSDS500, NYUD, and Multicue datasets are provided to demonstrate its effectiveness.
A faster version of PiDiNet with less than 0.1M parameters can still achieve comparable performance among state of the arts with 200 FPS.
arXiv Detail & Related papers (2021-08-16T10:42:59Z) - VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification.
We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO)
Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens.
Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.