Efficient Video Semantic Segmentation with Labels Propagation and
Refinement
- URL: http://arxiv.org/abs/1912.11844v1
- Date: Thu, 26 Dec 2019 11:45:15 GMT
- Title: Efficient Video Semantic Segmentation with Labels Propagation and
Refinement
- Authors: Matthieu Paul, Christoph Mayer, Luc Van Gool, Radu Timofte
- Abstract summary: This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
- Score: 138.55845680523908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles the problem of real-time semantic segmentation of high
definition videos using a hybrid GPU / CPU approach. We propose an Efficient
Video Segmentation(EVS) pipeline that combines:
(i) On the CPU, a very fast optical flow method, that is used to exploit the
temporal aspect of the video and propagate semantic information from one frame
to the next. It runs in parallel with the GPU.
(ii) On the GPU, two Convolutional Neural Networks: A main segmentation
network that is used to predict dense semantic labels from scratch, and a
Refiner that is designed to improve predictions from previous frames with the
help of a fast Inconsistencies Attention Module (IAM). The latter can identify
regions that cannot be propagated accurately.
We suggest several operating points depending on the desired frame rate and
accuracy. Our pipeline achieves accuracy levels competitive to the existing
real-time methods for semantic image segmentation(mIoU above 60%), while
achieving much higher frame rates. On the popular Cityscapes dataset with high
resolution frames (2048 x 1024), the proposed operating points range from 80 to
1000 Hz on a single GPU and CPU.
Related papers
- ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Space-time Reinforcement Network for Video Object Segmentation [16.67780344875854]
Video object segmentation (VOS) networks typically use memory-based methods.
These methods suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames, and 2) Pixel-level matching will lead to undesired mismatching.
In this paper, we propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one.
arXiv Detail & Related papers (2024-05-07T06:26:30Z) - Cross-CBAM: A Lightweight network for Scene Segmentation [2.064612766965483]
We present the Cross-CBAM network, a novel lightweight network for real-time semantic segmentation.
In experiments on the Cityscapes dataset and Camvid dataset, we achieve 73.4% mIoU with a speed of 240.9FPS and 77.2% mIoU with a speed of 88.6FPS on NVIDIA GTX 1080Ti.
arXiv Detail & Related papers (2023-06-04T09:03:05Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism.
We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z) - Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation.
We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.