Related papers: Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network

Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network

URL: http://arxiv.org/abs/2410.02224v1
Date: Thu, 3 Oct 2024 05:45:24 GMT
Title: Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network
Authors: Yangyang Qiu, Guoan Xu, Guangwei Gao, Zhenhua Guo, Yi Yu, Chia-Wen Lin,
Abstract summary: We propose a lightweight multiple-information interaction network for real-time semantic segmentation, called LMIINet. It effectively combines CNNs and Transformers while reducing redundant computations and memory footprint. With only 0.72M parameters and 11.74G FLOPs, LMIINet achieves 72.0% mIoU at 100 FPS on the Cityscapes test set and 69.94% mIoU at 160 FPS on the CamVid dataset.
Score: 37.84039482457571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, the integration of the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real-time scenarios. In this work, we propose a lightweight multiple-information interaction network for real-time semantic segmentation, called LMIINet, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprint. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. The incorporation of a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs, LMIINet achieves 72.0% mIoU at 100 FPS on the Cityscapes test set and 69.94% mIoU at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.

Related papers

ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network [0.0]
ECMNet combines CNN with Mamba skillfully in a capsule-based framework to address their complementary weaknesses.<n>The proposed model excels in accuracy and efficiency balance, achieving 70.6% mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets.
arXiv Detail & Related papers (2025-06-10T09:44:23Z)
ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction [14.377544481394013]
CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. Experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance.
arXiv Detail & Related papers (2024-10-15T09:27:26Z)
Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution [6.857919231112562]
Window-based transformers have demonstrated outstanding performance in super-resolution tasks. They exhibit higher computational complexity and inference latency than convolutional neural networks. We construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet)
arXiv Detail & Related papers (2024-09-26T07:24:09Z)
HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation [11.334990474402915]
We introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers. HAFormer achieves high performance with minimal computational overhead and compact model size.
arXiv Detail & Related papers (2024-07-10T07:53:24Z)
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z)
TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN [34.020978009518245]
We propose a lightweight real-time semantic segmentation network called LETNet. LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies. Experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance.
arXiv Detail & Related papers (2023-02-21T07:16:53Z)
Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF) We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results. In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z)
Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations. We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)
MSCFNet: A Lightweight Network With Multi-Scale Context Fusion for Real-Time Semantic Segmentation [27.232578592161673]
We devise a novel lightweight network using a multi-scale context fusion scheme (MSCFNet) The proposed MSCFNet contains only 1.15M parameters, achieves 71.9% Mean IoU and can run at over 50 FPS on a single Titan XP GPU configuration.
arXiv Detail & Related papers (2021-03-24T08:28:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.