AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing
- URL: http://arxiv.org/abs/2103.05930v1
- Date: Wed, 10 Mar 2021 08:38:29 GMT
- Title: AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing
- Authors: Qi Song and Kangfu Mei and Rui Huang
- Abstract summary: We propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics.
AttaNet consists of two primary modules: Strip Attention Module (SAM) and Attention Fusion Module (AFM)
- Score: 12.409365458889082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Two factors have proven to be very important to the performance of semantic
segmentation models: global context and multi-level semantics. However,
generating features that capture both factors always leads to high
computational complexity, which is problematic in real-time scenarios. In this
paper, we propose a new model, called Attention-Augmented Network (AttaNet), to
capture both global context and multilevel semantics while keeping the
efficiency high. AttaNet consists of two primary modules: Strip Attention
Module (SAM) and Attention Fusion Module (AFM). Viewing that in challenging
images with low segmentation accuracy, there are a significantly larger amount
of vertical strip areas than horizontal ones, SAM utilizes a striping operation
to reduce the complexity of encoding global context in the vertical direction
drastically while keeping most of contextual information, compared to the
non-local approaches. Moreover, AFM follows a cross-level aggregation strategy
to limit the computation, and adopts an attention strategy to weight the
importance of different levels of features at each pixel when fusing them,
obtaining an efficient multi-level representation. We have conducted extensive
experiments on two semantic segmentation benchmarks, and our network achieves
different levels of speed/accuracy trade-offs on Cityscapes, e.g., 71 FPS/79.9%
mIoU, 130 FPS/78.5% mIoU, and 180 FPS/70.1% mIoU, and leading performance on
ADE20K as well.
Related papers
- MacFormer: Semantic Segmentation with Fine Object Boundaries [38.430631361558426]
We introduce a new semantic segmentation architecture, MacFormer'', which features two key components.
Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers.
Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain.
MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on datasets benchmark ADE20K and Cityscapes.
arXiv Detail & Related papers (2024-08-11T05:36:10Z) - TOPIQ: A Top-down Approach from Semantics to Distortions for Image
Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks.
We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions.
A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - ASAP: Accurate semantic segmentation for real time performance [3.5327983932835165]
We propose an efficient feature fusion method, Feature Fusion with Different Norms (FFDN)
FFDN utilizes rich global context of multi-level scale and vertical pooling module before self-attention.
We achieve the mean Interaction of-union(mIoU) of 73.1 and the Frame Per Second(FPS) of 191, which are comparable results with state-of-the-arts on Cityscapes test datasets.
arXiv Detail & Related papers (2022-10-04T02:35:53Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - S$^2$-FPN: Scale-ware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation [6.744210626403423]
This paper presents a new model to achieve a trade-off between accuracy/speed for real-time road scene semantic segmentation.
Specifically, we proposed a lightweight model named Scale-aware Strip Attention Guided Feature Pyramid Network (S$2$-FPN)
Our network consists of three main modules: Attention Pyramid Fusion (APF) module, Scale-aware Strip Attention Module (SSAM), and Global Feature Upsample (GFU) module.
arXiv Detail & Related papers (2022-06-15T05:02:49Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z) - BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time
Semantic Segmentation [118.46210049742993]
We propose an efficient and effective architecture with a good trade-off between speed and accuracy, termed Bilateral spatial Network (BiSeNet V2)
For a 2,048x1, input, we achieve 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce 1080 Ti card, which is significantly faster than existing methods, yet we achieve better segmentation accuracy.
arXiv Detail & Related papers (2020-04-05T10:26:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.