Semantic Flow for Fast and Accurate Scene Parsing
- URL: http://arxiv.org/abs/2002.10120v3
- Date: Mon, 29 Mar 2021 08:43:13 GMT
- Title: Semantic Flow for Fast and Accurate Scene Parsing
- Authors: Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan
Yang, Yunhai Tong
- Abstract summary: Flow Alignment Module (FAM) learns Semantic Flow between feature maps of adjacent levels.
Experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid.
Our network is the first to achieve 80.4% mIoU on Cityscapes with a frame rate of 26 FPS.
- Score: 28.444273169423074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on designing effective method for fast and accurate
scene parsing. A common practice to improve the performance is to attain high
resolution feature maps with strong semantic representation. Two strategies are
widely used -- atrous convolutions and feature pyramid fusion, are either
computation intensive or ineffective. Inspired by the Optical Flow for motion
alignment between adjacent video frames, we propose a Flow Alignment Module
(FAM) to learn Semantic Flow between feature maps of adjacent levels, and
broadcast high-level features to high resolution features effectively and
efficiently. Furthermore, integrating our module to a common feature pyramid
structure exhibits superior performance over other real-time methods even on
light-weight backbone networks, such as ResNet-18. Extensive experiments are
conducted on several challenging datasets, including Cityscapes, PASCAL
Context, ADE20K and CamVid. Especially, our network is the first to achieve
80.4\% mIoU on Cityscapes with a frame rate of 26 FPS. The code is available at
\url{https://github.com/lxtGH/SFSegNets}.
Related papers
- Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - SFNet: Faster and Accurate Semantic Segmentation via Semantic Flow [88.97790684009979]
A common practice to improve the performance is to attain high-resolution feature maps with strong semantic representation.
We propose a Flow Alignment Module (FAM) to learn textitSemantic Flow between feature maps of adjacent levels.
We also present a novel Gated Dual Flow Alignment Module to directly align high-resolution feature maps and low-resolution feature maps.
arXiv Detail & Related papers (2022-07-10T08:25:47Z) - Stage-Aware Feature Alignment Network for Real-Time Semantic
Segmentation of Street Scenes [59.81228011432776]
We present a novel Stage-aware Feature Alignment Network (SFANet) for real-time semantic segmentation of street scenes.
By taking into account the unique role of each stage in the decoder, a novel stage-aware Feature Enhancement Block (FEB) is designed to enhance spatial details and contextual information of feature maps from the encoder.
Experimental results show that the proposed SFANet exhibits a good balance between accuracy and speed for real-time semantic segmentation of street scenes.
arXiv Detail & Related papers (2022-03-08T11:46:41Z) - KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal
Action Localization [0.9507070656654633]
Real-time and online action localization in a video is a critical yet highly challenging problem.
Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow.
We propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions.
Our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.
arXiv Detail & Related papers (2021-11-05T08:39:36Z) - Optical-Flow-Reuse-Based Bidirectional Recurrent Network for Space-Time
Video Super-Resolution [52.899234731501075]
Space-time video super-resolution (ST-VSR) simultaneously increases the spatial resolution and frame rate for a given video.
Existing methods typically suffer from difficulties in how to efficiently leverage information from a large range of neighboring frames.
We propose a coarse-to-fine bidirectional recurrent neural network instead of using ConvLSTM to leverage knowledge between adjacent frames.
arXiv Detail & Related papers (2021-10-13T15:21:30Z) - Progressive Temporal Feature Alignment Network for Video Inpainting [51.26380898255555]
Video convolution aims to fill in-temporal "corrupted regions" with plausible content.
Current methods achieve this goal through attention, flow-based warping, or 3D temporal convolution.
We propose 'Progressive Temporal Feature Alignment Network', which progressively enriches features extracted from the current frame with the warped feature from neighbouring frames.
arXiv Detail & Related papers (2021-04-08T04:50:33Z) - AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing [12.409365458889082]
We propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics.
AttaNet consists of two primary modules: Strip Attention Module (SAM) and Attention Fusion Module (AFM)
arXiv Detail & Related papers (2021-03-10T08:38:29Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z) - FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale
Context Aggregation and Feature Space Super-resolution [14.226301825772174]
We introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP)
It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information.
We achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card.
arXiv Detail & Related papers (2020-03-09T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.