Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation
- URL: http://arxiv.org/abs/2206.09604v1
- Date: Mon, 20 Jun 2022 07:20:02 GMT
- Title: Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation
- Authors: Hyunsu Rhee, Dongchan Min, Sunil Hwang, Bruno Andreis, Sung Ju Hwang
- Abstract summary: We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
- Score: 49.17930380106643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time video segmentation is a crucial task for many real-world
applications such as autonomous driving and robot control. Since
state-of-the-art semantic segmentation models are often too heavy for real-time
applications despite their impressive performance, researchers have proposed
lightweight architectures with speed-accuracy trade-offs, achieving real-time
speed at the expense of reduced accuracy. In this paper, we propose a novel
framework to speed up any architecture with skip-connections for real-time
vision tasks by exploiting the temporal locality in videos. Specifically, at
the arrival of each frame, we transform the features from the previous frame to
reuse them at specific spatial bins. We then perform partial computation of the
backbone network on the regions of the current frame that captures temporal
differences between the current and previous frame. This is done by dynamically
dropping out residual blocks using a gating mechanism which decides which
blocks to drop based on inter-frame distortion. We validate our
Spatial-Temporal Mask Generator (STMG) on video semantic segmentation
benchmarks with multiple backbone networks, and show that our method largely
speeds up inference with minimal loss of accuracy.
Related papers
- Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - Representation Recycling for Streaming Video Analysis [19.068248496174903]
StreamDEQ aims to infer frame-wise representations on videos with minimal per-frame computation.
We show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration.
arXiv Detail & Related papers (2022-04-28T13:35:14Z) - Borrowing from yourself: Faster future video segmentation with partial
channel update [0.0]
We propose to tackle the task of fast future video segmentation prediction through the use of convolutional layers with time-dependent channel masking.
This technique only updates a chosen subset of the feature maps at each time-step, bringing simultaneously less computation and latency.
We apply this technique to several fast architectures and experimentally confirm its benefits for the future prediction subtask.
arXiv Detail & Related papers (2022-02-11T16:37:53Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation.
We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.