Related papers: Efficient Video Semantic Segmentation with Labels Propagation and Refinement

Efficient Video Semantic Segmentation with Labels Propagation and Refinement

URL: http://arxiv.org/abs/1912.11844v1
Date: Thu, 26 Dec 2019 11:45:15 GMT
Title: Efficient Video Semantic Segmentation with Labels Propagation and Refinement
Authors: Matthieu Paul, Christoph Mayer, Luc Van Gool, Radu Timofte
Abstract summary: This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
Score: 138.55845680523908
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video Segmentation(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. It runs in parallel with the GPU. (ii) On the GPU, two Convolutional Neural Networks: A main segmentation network that is used to predict dense semantic labels from scratch, and a Refiner that is designed to improve predictions from previous frames with the help of a fast Inconsistencies Attention Module (IAM). The latter can identify regions that cannot be propagated accurately. We suggest several operating points depending on the desired frame rate and accuracy. Our pipeline achieves accuracy levels competitive to the existing real-time methods for semantic image segmentation(mIoU above 60%), while achieving much higher frame rates. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.

Related papers

Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation. We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges [39.666361965650836]
VideoLLaMB is a framework for long video understanding.<n> SceneTiling algorithm segments videos into coherent semantic units.<n>VideoLLaMB processes up to 320 frames using a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z)
Space-time Reinforcement Network for Video Object Segmentation [16.67780344875854]
Video object segmentation (VOS) networks typically use memory-based methods. These methods suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames, and 2) Pixel-level matching will lead to undesired mismatching. In this paper, we propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one.
arXiv Detail & Related papers (2024-05-07T06:26:30Z)
Cross-CBAM: A Lightweight network for Scene Segmentation [2.064612766965483]
We present the Cross-CBAM network, a novel lightweight network for real-time semantic segmentation. In experiments on the Cityscapes dataset and Camvid dataset, we achieve 73.4% mIoU with a speed of 240.9FPS and 77.2% mIoU with a speed of 88.6FPS on NVIDIA GTX 1080Ti.
arXiv Detail & Related papers (2023-06-04T09:03:05Z)
OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus) A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z)
Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame. We exploit this observation by using temporal information to quickly identify frames with minimal change. We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z)
Real-time Semantic Segmentation with Fast Attention [94.88466483540692]
We propose a novel architecture for semantic segmentation of high-resolution images and videos in real-time. The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism. We show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches.
arXiv Detail & Related papers (2020-07-07T22:37:16Z)
Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z)
Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process. We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.