HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation
- URL: http://arxiv.org/abs/2407.07441v2
- Date: Thu, 11 Jul 2024 02:19:44 GMT
- Title: HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation
- Authors: Guoan Xu, Wenjing Jia, Tao Wu, Ligeng Chen, Guangwei Gao,
- Abstract summary: We introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers.
HAFormer achieves high performance with minimal computational overhead and compact model size.
- Score: 11.334990474402915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Both Convolutional Neural Networks (CNNs) and Transformers have shown great success in semantic segmentation tasks. Efforts have been made to integrate CNNs with Transformer models to capture both local and global context interactions. However, there is still room for enhancement, particularly when considering constraints on computational resources. In this paper, we introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers to tackle lightweight semantic segmentation challenges. Specifically, we design a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. During the global perception modeling, we devise an Efficient Transformer (ET) module streamlining the quadratic calculations associated with traditional Transformers. Moreover, a correlation-weighted Fusion (cwF) module selectively merges diverse feature representations, significantly enhancing predictive accuracy. HAFormer achieves high performance with minimal computational overhead and compact model size, achieving 74.2% mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. The source codes are available at https://github.com/XU-GITHUB-curry/HAFormer.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Lightweight Real-time Semantic Segmentation Network with Efficient
Transformer and CNN [34.020978009518245]
We propose a lightweight real-time semantic segmentation network called LETNet.
LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies.
Experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance.
arXiv Detail & Related papers (2023-02-21T07:16:53Z) - Lite-Mono: A Lightweight CNN and Transformer Architecture for
Self-Supervised Monocular Depth Estimation [9.967643080731683]
We investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono.
A full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
arXiv Detail & Related papers (2022-11-23T18:43:41Z) - LAPFormer: A Light and Accurate Polyp Segmentation Transformer [6.352264764099531]
We propose a new model with encoder-decoder architecture named LAPFormer, which uses a hierarchical Transformer encoder to better extract global feature.
Our proposed decoder contains a progressive feature fusion module designed for fusing feature from upper scales and lower scales.
We test our model on five popular benchmark datasets for polyp segmentation.
arXiv Detail & Related papers (2022-10-10T01:52:30Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - HiFormer: Hierarchical Multi-scale Representations Using Transformers
for Medical Image Segmentation [3.478921293603811]
HiFormer is a novel method that efficiently bridges a CNN and a transformer for medical image segmentation.
To secure a fine fusion of global and local features, we propose a Double-Level Fusion (DLF) module in the skip connection of the encoder-decoder structure.
arXiv Detail & Related papers (2022-07-18T11:30:06Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.