GSVNet: Guided Spatially-Varying Convolution for Fast Semantic
Segmentation on Video
- URL: http://arxiv.org/abs/2103.08834v1
- Date: Tue, 16 Mar 2021 03:38:59 GMT
- Title: GSVNet: Guided Spatially-Varying Convolution for Fast Semantic
Segmentation on Video
- Authors: Shih-Po Lee, Si-Cun Chen, Wen-Hsiao Peng
- Abstract summary: We propose a simple yet efficient propagation framework for video segmentation.
We perform lightweight flow estimation in 1/8-downscaled image space for temporal warping in segmentation outpace space.
We introduce a guided spatially-varying convolution for fusing segmentations derived from the previous and current frames, to mitigate propagation error.
- Score: 10.19019476978683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses fast semantic segmentation on video.Video segmentation
often calls for real-time, or even fasterthan real-time, processing. One common
recipe for conserving computation arising from feature extraction is to
propagate features of few selected keyframes. However, recent advances in fast
image segmentation make these solutions less attractive. To leverage fast image
segmentation for furthering video segmentation, we propose a simple yet
efficient propagation framework. Specifically, we perform lightweight flow
estimation in 1/8-downscaled image space for temporal warping in segmentation
outpace space. Moreover, we introduce a guided spatially-varying convolution
for fusing segmentations derived from the previous and current frames, to
mitigate propagation error and enable lightweight feature extraction on
non-keyframes. Experimental results on Cityscapes and CamVid show that our
scheme achieves the state-of-the-art accuracy-throughput trade-off on video
segmentation.
Related papers
- A Simple Video Segmenter by Tracking Objects Along Axial Trajectories [30.272535124699164]
Video segmentation requires consistently segmenting and tracking objects over time.
Due to the quadratic dependency on input size, directly applying self-attention to video segmentation with high-resolution input features poses significant challenges.
We present Axial-VS, a framework that enhances video segmenters by tracking objects along axial trajectories.
arXiv Detail & Related papers (2023-11-30T13:20:09Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - End-to-End Compressed Video Representation Learning for Generic Event
Boundary Detection [31.31508043234419]
We propose a new end-to-end compressed video representation learning for event boundary detection.
We first use the ConvNets to extract features of the I-frames in the GOPs.
After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames.
A temporal contrastive module is proposed to determine the event boundaries of video sequences.
arXiv Detail & Related papers (2022-03-29T08:27:48Z) - Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video.
Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream.
With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z) - Local Memory Attention for Fast Video Semantic Segmentation [157.7618884769969]
We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline.
Our approach aggregates a rich representation of the semantic information in past frames into a memory module.
We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.
arXiv Detail & Related papers (2021-01-05T18:57:09Z) - Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation.
We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.