Tamed Warping Network for High-Resolution Semantic Video Segmentation
- URL: http://arxiv.org/abs/2005.01344v4
- Date: Tue, 11 Jul 2023 08:54:31 GMT
- Title: Tamed Warping Network for High-Resolution Semantic Video Segmentation
- Authors: Songyuan Li, Junyi Feng, and Xi Li
- Abstract summary: We build a non-key-frame CNN, fusing warped context features with current spatial details.
Based on the feature fusion, our Context Feature Rectification(CFR) module learns the model's difference from a per-frame model to correct the warped features.
Our Residual-Guided Attention(RGA) module utilizes the residual maps in the compressed domain to help CRF focus on error-prone regions.
- Score: 14.553335231691877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent approaches for fast semantic video segmentation have reduced
redundancy by warping feature maps across adjacent frames, greatly speeding up
the inference phase. However, the accuracy drops seriously owing to the errors
incurred by warping. In this paper, we propose a novel framework and design a
simple and effective correction stage after warping. Specifically, we build a
non-key-frame CNN, fusing warped context features with current spatial details.
Based on the feature fusion, our Context Feature Rectification~(CFR) module
learns the model's difference from a per-frame model to correct the warped
features. Furthermore, our Residual-Guided Attention~(RGA) module utilizes the
residual maps in the compressed domain to help CRF focus on error-prone
regions. Results on Cityscapes show that the accuracy significantly increases
from $67.3\%$ to $71.6\%$, and the speed edges down from $65.5$ FPS to $61.8$
FPS at a resolution of $1024\times 2048$. For non-rigid categories, e.g.,
``human'' and ``object'', the improvements are even higher than 18 percentage
points.
Related papers
- PyNeRF: Pyramidal Neural Radiance Fields [51.25406129834537]
We propose a simple modification to grid-based models by training model heads at different spatial grid resolutions.
At render time, we simply use coarser grids to render samples that cover larger volumes.
Compared to Mip-NeRF, we reduce error rates by 20% while training over 60x faster.
arXiv Detail & Related papers (2023-11-30T23:52:46Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - Global Context Aggregation Network for Lightweight Saliency Detection of
Surface Defects [70.48554424894728]
We develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure.
First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module.
The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods.
arXiv Detail & Related papers (2023-09-22T06:19:11Z) - Recurrence without Recurrence: Stable Video Landmark Detection with Deep
Equilibrium Models [96.76758318732308]
We show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation.
Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the WFLW facial landmark dataset.
arXiv Detail & Related papers (2023-04-02T19:08:02Z) - Stage-Aware Feature Alignment Network for Real-Time Semantic
Segmentation of Street Scenes [59.81228011432776]
We present a novel Stage-aware Feature Alignment Network (SFANet) for real-time semantic segmentation of street scenes.
By taking into account the unique role of each stage in the decoder, a novel stage-aware Feature Enhancement Block (FEB) is designed to enhance spatial details and contextual information of feature maps from the encoder.
Experimental results show that the proposed SFANet exhibits a good balance between accuracy and speed for real-time semantic segmentation of street scenes.
arXiv Detail & Related papers (2022-03-08T11:46:41Z) - PlaneSegNet: Fast and Robust Plane Estimation Using a Single-stage
Instance Segmentation CNN [12.251947429149796]
We propose a real-time deep neural architecture that estimates piece-wise planar regions from a single RGB image.
Our method achieves significantly higher frame-rates and comparable segmentation accuracy against two-stage methods.
arXiv Detail & Related papers (2021-03-29T08:53:05Z) - Deep Dual-resolution Networks for Real-time and Accurate Semantic
Segmentation of Road Scenes [0.23090185577016442]
We propose novel deep dual-resolution networks ( DDRNets) for real-time semantic segmentation of road scenes.
Our method achieves new state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid dataset.
arXiv Detail & Related papers (2021-01-15T12:56:18Z) - A Backbone Replaceable Fine-tuning Framework for Stable Face Alignment [21.696696531924374]
We propose a Jitter loss function that leverages temporal information to suppress inaccurate as well as jittered landmarks.
The proposed framework achieves at least 40% improvement on stability evaluation metrics.
It can swiftly convert a landmark detector for facial images to a better-performing one for videos without retraining the entire model.
arXiv Detail & Related papers (2020-10-19T13:40:39Z) - Recurrent Feature Reasoning for Image Inpainting [110.24760191732905]
Recurrent Feature Reasoning (RFR) network is mainly constructed by a plug-and-play Recurrent Feature Reasoning module and a Knowledge Consistent Attention (KCA) module.
RFR module recurrently infers the hole boundaries of the convolutional feature maps and then uses them as clues for further inference.
To capture information from distant places in the feature map for RFR, we further develop KCA and incorporate it in RFR.
arXiv Detail & Related papers (2020-08-09T14:40:04Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.