Related papers: TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge

TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge

URL: http://arxiv.org/abs/2003.13260v3
Date: Tue, 18 Aug 2020 06:52:41 GMT
Title: TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge
Authors: Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, and Haibin Ling
Abstract summary: Real-time semantic video segmentation is a challenging task due to the strict requirements of inference speed. Recent approaches mainly devote great efforts to reducing the model size for high efficiency. We propose a simple and effective framework, dubbed TapLab, to tap into resources from the compressed domain.
Score: 161.4188504786512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-time semantic video segmentation is a challenging task due to the strict requirements of inference speed. Recent approaches mainly devote great efforts to reducing the model size for high efficiency. In this paper, we rethink this problem from a different viewpoint: using knowledge contained in compressed videos. We propose a simple and effective framework, dubbed TapLab, to tap into resources from the compressed domain. Specifically, we design a fast feature warping module using motion vectors for acceleration. To reduce the noise introduced by motion vectors, we design a residual-guided correction module and a residual-guided frame selection module using residuals. TapLab significantly reduces redundant computations of the state-of-the-art fast semantic image segmentation models, running 3 to 10 times faster with controllable accuracy degradation. The experimental results show that TapLab achieves 70.6% mIoU on the Cityscapes dataset at 99.8 FPS with a single GPU card for the 1024x2048 videos. A high-speed version even reaches the speed of 160+ FPS. Codes will be available soon at https://github.com/Sixkplus/TapLab.

Related papers

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z)
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench. Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.
arXiv Detail & Related papers (2024-12-29T15:42:24Z)
VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a compact video autoencoder that decouples video into two distinct latent spaces. Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements. Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z)
QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos [42.554100586090826]
Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored. We propose a novel framework for QUantized and Efficient ENcoding for streaming FVV using 3D Gaussianting. We further propose a quantization-sparity framework, which contains a learned latent-decoder for effectively quantizing residuals other than Gaussian positions.
arXiv Detail & Related papers (2024-12-05T18:59:55Z)
REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents. We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z)
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z)
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding [38.60950616529459]
We propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as textitSqueezeTime, for mobile video understanding. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding.
arXiv Detail & Related papers (2024-05-14T06:32:40Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z)
Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis [40.249030338644225]
Video-to-Video synthesis (Vid2Vid) has achieved remarkable results in generating a photo-realistic video from a sequence of semantic maps. Fast-Vid2Vid achieves around real-time performance as 20 FPS and saves around 8x computational cost on a single V100 GPU.
arXiv Detail & Related papers (2022-07-11T17:57:57Z)
Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video. Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream. With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z)
FastRIFE: Optimization of Real-Time Intermediate Flow Estimation for Video Frame Interpolation [0.0]
This paper proposes the FastRIFE algorithm, which is some speed improvement of the RIFE (Real-Time Intermediate Flow Estimation) model. All source codes are available at https://gitlab.com/malwinq/interpolation-of-images-for-slow-motion-videos.
arXiv Detail & Related papers (2021-05-27T22:31:40Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.