Efficient Semantic Segmentation by Altering Resolutions for Compressed
Videos
- URL: http://arxiv.org/abs/2303.07224v1
- Date: Mon, 13 Mar 2023 15:58:15 GMT
- Title: Efficient Semantic Segmentation by Altering Resolutions for Compressed
Videos
- Authors: Yubin Hu, Yuze He, Yanghao Li, Jisheng Li, Yuxing Han, Jiangtao Wen,
Yong-Jin Liu
- Abstract summary: We propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient video segmentation.
AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes.
Experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance.
- Score: 42.944135041061166
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video semantic segmentation (VSS) is a computationally expensive task due to
the per-frame prediction for videos of high frame rates. In recent work,
compact models or adaptive network strategies have been proposed for efficient
VSS. However, they did not consider a crucial factor that affects the
computational cost from the input side: the input resolution. In this paper, we
propose an altering resolution framework called AR-Seg for compressed videos to
achieve efficient VSS. AR-Seg aims to reduce the computational cost by using
low resolution for non-keyframes. To prevent the performance degradation caused
by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module,
and supervise it with a novel Feature Similarity Training (FST) strategy.
Specifically, CReFF first makes use of motion vectors stored in a compressed
video to warp features from high-resolution keyframes to low-resolution
non-keyframes for better spatial alignment, and then selectively aggregates the
warped features with local attention mechanism. Furthermore, the proposed FST
supervises the aggregated features with high-resolution features through an
explicit similarity loss and an implicit constraint from the shared decoding
layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves
state-of-the-art performance and is compatible with different segmentation
backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs)
with the PSPNet18 backbone while maintaining high segmentation accuracy. Code:
https://github.com/THU-LYJ-Lab/AR-Seg.
Related papers
- High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency.
Uses hierarchical predictive coding to transform each video frame into multiscale representations.
Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z) - Differentiable Resolution Compression and Alignment for Efficient Video
Classification and Retrieval [16.497758750494537]
We propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism.
We leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features.
We introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions.
arXiv Detail & Related papers (2023-09-15T05:31:53Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - A Codec Information Assisted Framework for Efficient Compressed Video
Super-Resolution [15.690562510147766]
Video Super-Resolution (VSR) using recurrent neural network architecture is a promising solution due to its efficient modeling of long-range temporal dependencies.
We propose a Codec Information Assisted Framework (CIAF) to boost and accelerate recurrent VSR models for compressed videos.
arXiv Detail & Related papers (2022-10-15T08:48:29Z) - Learned Video Compression via Heterogeneous Deformable Compensation
Network [78.72508633457392]
We propose a learned video compression framework via heterogeneous deformable compensation strategy (HDCVC) to tackle the problems of unstable compression performance.
More specifically, the proposed algorithm extracts features from the two adjacent frames to estimate content-Neighborhood heterogeneous deformable (HetDeform) kernel offsets.
Experimental results indicate that HDCVC achieves superior performance than the recent state-of-the-art learned video compression approaches.
arXiv Detail & Related papers (2022-07-11T02:31:31Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Decomposition, Compression, and Synthesis (DCS)-based Video Coding: A
Neural Exploration via Resolution-Adaptive Learning [30.54722074562783]
We decompose the input video into respective spatial texture frames (STF) at its native spatial resolution.
Then, we compress them together using any popular video coder.
Finally, we synthesize decoded STFs and TMFs for high-quality video reconstruction at the same resolution as its native input.
arXiv Detail & Related papers (2020-12-01T17:23:53Z) - Deep Space-Time Video Upsampling Networks [47.62807427163614]
Video super-resolution (VSR) and frame (FI) are traditional computer vision problems.
We propose an end-to-end framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework.
Results show better results both quantitatively and qualitatively, while reducing the time (x7 faster) and the number of parameters (30%) compared to baselines.
arXiv Detail & Related papers (2020-04-06T07:04:21Z) - Video Face Super-Resolution with Motion-Adaptive Feedback Cell [90.73821618795512]
Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN)
In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way.
arXiv Detail & Related papers (2020-02-15T13:14:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.