TapLab: A Fast Framework for Semantic Video Segmentation Tapping into
Compressed-Domain Knowledge
- URL: http://arxiv.org/abs/2003.13260v3
- Date: Tue, 18 Aug 2020 06:52:41 GMT
- Title: TapLab: A Fast Framework for Semantic Video Segmentation Tapping into
Compressed-Domain Knowledge
- Authors: Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, and
Haibin Ling
- Abstract summary: Real-time semantic video segmentation is a challenging task due to the strict requirements of inference speed.
Recent approaches mainly devote great efforts to reducing the model size for high efficiency.
We propose a simple and effective framework, dubbed TapLab, to tap into resources from the compressed domain.
- Score: 161.4188504786512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time semantic video segmentation is a challenging task due to the strict
requirements of inference speed. Recent approaches mainly devote great efforts
to reducing the model size for high efficiency. In this paper, we rethink this
problem from a different viewpoint: using knowledge contained in compressed
videos. We propose a simple and effective framework, dubbed TapLab, to tap into
resources from the compressed domain. Specifically, we design a fast feature
warping module using motion vectors for acceleration. To reduce the noise
introduced by motion vectors, we design a residual-guided correction module and
a residual-guided frame selection module using residuals. TapLab significantly
reduces redundant computations of the state-of-the-art fast semantic image
segmentation models, running 3 to 10 times faster with controllable accuracy
degradation. The experimental results show that TapLab achieves 70.6% mIoU on
the Cityscapes dataset at 99.8 FPS with a single GPU card for the 1024x2048
videos. A high-speed version even reaches the speed of 160+ FPS. Codes will be
available soon at https://github.com/Sixkplus/TapLab.
Related papers
- VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a video autoencoder that decouples video into two distinct latent spaces.
Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements.
Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z) - QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos [42.554100586090826]
Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored.
We propose a novel framework for QUantized and Efficient ENcoding for streaming FVV using 3D Gaussianting.
We further propose a quantization-sparity framework, which contains a learned latent-decoder for effectively quantizing residuals other than Gaussian positions.
arXiv Detail & Related papers (2024-12-05T18:59:55Z) - REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.
In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents.
We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding [38.60950616529459]
We propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as textitSqueezeTime, for mobile video understanding.
The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding.
arXiv Detail & Related papers (2024-05-14T06:32:40Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z) - Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis [40.249030338644225]
Video-to-Video synthesis (Vid2Vid) has achieved remarkable results in generating a photo-realistic video from a sequence of semantic maps.
Fast-Vid2Vid achieves around real-time performance as 20 FPS and saves around 8x computational cost on a single V100 GPU.
arXiv Detail & Related papers (2022-07-11T17:57:57Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.