TapLab: A Fast Framework for Semantic Video Segmentation Tapping into
Compressed-Domain Knowledge
- URL: http://arxiv.org/abs/2003.13260v3
- Date: Tue, 18 Aug 2020 06:52:41 GMT
- Title: TapLab: A Fast Framework for Semantic Video Segmentation Tapping into
Compressed-Domain Knowledge
- Authors: Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, and
Haibin Ling
- Abstract summary: Real-time semantic video segmentation is a challenging task due to the strict requirements of inference speed.
Recent approaches mainly devote great efforts to reducing the model size for high efficiency.
We propose a simple and effective framework, dubbed TapLab, to tap into resources from the compressed domain.
- Score: 161.4188504786512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time semantic video segmentation is a challenging task due to the strict
requirements of inference speed. Recent approaches mainly devote great efforts
to reducing the model size for high efficiency. In this paper, we rethink this
problem from a different viewpoint: using knowledge contained in compressed
videos. We propose a simple and effective framework, dubbed TapLab, to tap into
resources from the compressed domain. Specifically, we design a fast feature
warping module using motion vectors for acceleration. To reduce the noise
introduced by motion vectors, we design a residual-guided correction module and
a residual-guided frame selection module using residuals. TapLab significantly
reduces redundant computations of the state-of-the-art fast semantic image
segmentation models, running 3 to 10 times faster with controllable accuracy
degradation. The experimental results show that TapLab achieves 70.6% mIoU on
the Cityscapes dataset at 99.8 FPS with a single GPU card for the 1024x2048
videos. A high-speed version even reaches the speed of 160+ FPS. Codes will be
available soon at https://github.com/Sixkplus/TapLab.
Related papers
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding [38.60950616529459]
We propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as textitSqueezeTime, for mobile video understanding.
The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding.
arXiv Detail & Related papers (2024-05-14T06:32:40Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z) - Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis [40.249030338644225]
Video-to-Video synthesis (Vid2Vid) has achieved remarkable results in generating a photo-realistic video from a sequence of semantic maps.
Fast-Vid2Vid achieves around real-time performance as 20 FPS and saves around 8x computational cost on a single V100 GPU.
arXiv Detail & Related papers (2022-07-11T17:57:57Z) - Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video.
Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream.
With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z) - FastRIFE: Optimization of Real-Time Intermediate Flow Estimation for
Video Frame Interpolation [0.0]
This paper proposes the FastRIFE algorithm, which is some speed improvement of the RIFE (Real-Time Intermediate Flow Estimation) model.
All source codes are available at https://gitlab.com/malwinq/interpolation-of-images-for-slow-motion-videos.
arXiv Detail & Related papers (2021-05-27T22:31:40Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.