Representation Recycling for Streaming Video Analysis
- URL: http://arxiv.org/abs/2204.13492v4
- Date: Sat, 6 Jan 2024 23:30:03 GMT
- Title: Representation Recycling for Streaming Video Analysis
- Authors: Can Ufuk Ertenli, Ramazan Gokberk Cinbis, Emre Akbas
- Abstract summary: StreamDEQ aims to infer frame-wise representations on videos with minimal per-frame computation.
We show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration.
- Score: 19.068248496174903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present StreamDEQ, a method that aims to infer frame-wise representations
on videos with minimal per-frame computation. Conventional deep networks do
feature extraction from scratch at each frame in the absence of ad-hoc
solutions. We instead aim to build streaming recognition models that can
natively exploit temporal smoothness between consecutive video frames. We
observe that the recently emerging implicit layer models provide a convenient
foundation to construct such models, as they define representations as the
fixed-points of shallow networks, which need to be estimated using iterative
methods. Our main insight is to distribute the inference iterations over the
temporal axis by using the most recent representation as a starting point at
each frame. This scheme effectively recycles the recent inference computations
and greatly reduces the needed processing time. Through extensive experimental
analysis, we show that StreamDEQ is able to recover near-optimal
representations in a few frames' time and maintain an up-to-date representation
throughout the video duration. Our experiments on video semantic segmentation,
video object detection, and human pose estimation in videos show that StreamDEQ
achieves on-par accuracy with the baseline while being more than 2-4x faster.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - DynPoint: Dynamic Neural Point For View Synthesis [45.44096876841621]
We propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos.
DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation.
Our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.
arXiv Detail & Related papers (2023-10-29T12:55:53Z) - ResQ: Residual Quantization for Video Perception [18.491197847596283]
We propose a novel quantization scheme for video networks coined as Residual Quantization.
We extend our model to dynamically adjust the bit-width proportional to the amount of changes in the video.
arXiv Detail & Related papers (2023-08-18T12:41:10Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation.
We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.