Spatial-Temporal Transformer based Video Compression Framework
- URL: http://arxiv.org/abs/2309.11913v1
- Date: Thu, 21 Sep 2023 09:23:13 GMT
- Title: Spatial-Temporal Transformer based Video Compression Framework
- Authors: Yanbo Gao, Wenjia Huang, Shuai Li, Hui Yuan, Mao Ye, Siwei Ma
- Abstract summary: We propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework.
It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression.
Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.
- Score: 44.723459144708286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learned video compression (LVC) has witnessed remarkable advancements in
recent years. Similar as the traditional video coding, LVC inherits motion
estimation/compensation, residual coding and other modules, all of which are
implemented with neural networks (NNs). However, within the framework of NNs
and its training mechanism using gradient backpropagation, most existing works
often struggle to consistently generate stable motion information, which is in
the form of geometric features, from the input color features. Moreover, the
modules such as the inter-prediction and residual coding are independent from
each other, making it inefficient to fully reduce the spatial-temporal
redundancy. To address the above problems, in this paper, we propose a novel
Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It
contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets
estimation for motion estimation and compensation, a Multi-Granularity
Prediction (MGP) module based on multi-reference frames for prediction
refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T)
for efficient temporal-spatial joint residual compression. Specifically, RDT is
developed to stably estimate the motion information between frames by
thoroughly investigating the relationship between the similarity based
geometric motion feature extraction and self-attention. MGP is designed to fuse
the multi-reference frame information by effectively exploring the
coarse-grained prediction feature generated with the coded motion information.
SFD-T is to compress the residual information by jointly exploring the spatial
feature distributions in both residual and temporal prediction to further
reduce the spatial-temporal redundancy. Experimental results demonstrate that
our method achieves the best result with 13.5% BD-Rate saving over VTM.
Related papers
- Collaborative Feedback Discriminative Propagation for Video Super-Resolution [66.61201445650323]
Key success of video super-resolution (VSR) methods stems mainly from exploring spatial and temporal information.
Inaccurate alignment usually leads to aligned features with significant artifacts.
propagation modules only propagate the same timestep features forward or backward.
arXiv Detail & Related papers (2024-04-06T22:08:20Z) - Spatial Decomposition and Temporal Fusion based Inter Prediction for
Learned Video Compression [59.632286735304156]
We propose a spatial decomposition and temporal fusion based inter prediction for learned video compression.
With the SDD-based motion model and long short-term temporal fusion, our proposed learned video can obtain more accurate inter prediction contexts.
arXiv Detail & Related papers (2024-01-29T03:30:21Z) - Multiscale Motion-Aware and Spatial-Temporal-Channel Contextual Coding
Network for Learned Video Compression [24.228981098990726]
We propose a motion-aware and spatial-temporal-channel contextual coding based video compression network (MASTC-VC)
Our proposed MASTC-VC is surprior to previous state-of-the-art (SOTA) methods on three public benchmark datasets.
Our method brings average 10.15% BD-rate savings against H.265/HEVC (HM-16.20) in PSNR metric and average 23.93% BD-rate savings against H.266/VVC (VTM-13.2) in MS-SSIM metric.
arXiv Detail & Related papers (2023-10-19T13:32:38Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - JNMR: Joint Non-linear Motion Regression for Video Frame Interpolation [47.123769305867775]
Video frame (VFI) aims to generate frames by warping learnable motions from the bidirectional historical references.
We reformulate VFI as a Joint Non-linear Motion Regression (JNMR) strategy to model the complicated motions of inter-frame.
We show that the effectiveness and significant improvement of joint motion regression compared with state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T02:47:29Z) - Self-Supervised Learning of Perceptually Optimized Block Motion
Estimates for Video Compression [50.48504867843605]
We propose a search-free block motion estimation framework using a multi-stage convolutional neural network.
We deploy the multi-scale structural similarity (MS-SSIM) loss function to optimize the perceptual quality of the motion compensated predicted frames.
arXiv Detail & Related papers (2021-10-05T03:38:43Z) - FVC: A New Framework towards Deep Video Compression in Feature Space [21.410266039564803]
We propose a feature-space video coding network (FVC) by performing all major operations (i.e., motion estimation, motion compression, motion compensation and residual compression) in the feature space.
The proposed framework achieves the state-of-the-art performance on four benchmark datasets including HEVC, UVG, VTL and MCL-JCV.
arXiv Detail & Related papers (2021-05-20T08:55:32Z) - Spatiotemporal Entropy Model is All You Need for Learned Video
Compression [9.227865598115024]
We propose a framework to compress raw-pixel frames (rather than residual images)
An entropy model is used to estimate thetemporal redundancy in a latent space rather than pixel level.
Experiments showed that the proposed method outperforms state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2021-04-13T10:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.