Object Segmentation-Assisted Inter Prediction for Versatile Video Coding
- URL: http://arxiv.org/abs/2403.11694v1
- Date: Mon, 18 Mar 2024 11:48:20 GMT
- Title: Object Segmentation-Assisted Inter Prediction for Versatile Video Coding
- Authors: Zhuoyuan Li, Zikun Yuan, Li Li, Dong Liu, Xiaohu Tang, Feng Wu,
- Abstract summary: We propose an object segmentation-assisted inter prediction method (SAIP), where objects in the reference frames are segmented by some advanced technologies.
With a proper indication, the object segmentation mask is translated from the reference frame to the current frame as the arbitrary-shaped partition of different regions.
We show that the proposed method achieves up to 1.98%, 1.14%, 0.79%, and on average 0.82%, 0.49%, 0.37% BD-rate reduction for common test sequences.
- Score: 53.91821712591901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In modern video coding standards, block-based inter prediction is widely adopted, which brings high compression efficiency. However, in natural videos, there are usually multiple moving objects of arbitrary shapes, resulting in complex motion fields that are difficult to compactly represent. This problem has been tackled by more flexible block partitioning methods in the Versatile Video Coding (VVC) standard, but the more flexible partitions require more overhead bits to signal and still cannot be made arbitrary shaped. To address this limitation, we propose an object segmentation-assisted inter prediction method (SAIP), where objects in the reference frames are segmented by some advanced technologies. With a proper indication, the object segmentation mask is translated from the reference frame to the current frame as the arbitrary-shaped partition of different regions without any extra signal. Using the segmentation mask, motion compensation is separately performed for different regions, achieving higher prediction accuracy. The segmentation mask is further used to code the motion vectors of different regions more efficiently. Moreover, segmentation mask is considered in the joint rate-distortion optimization for motion estimation and partition estimation to derive the motion vector of different regions and partition more accurately. The proposed method is implemented into the VVC reference software, VTM version 12.0. Experimental results show that the proposed method achieves up to 1.98%, 1.14%, 0.79%, and on average 0.82%, 0.49%, 0.37% BD-rate reduction for common test sequences, under the Low-delay P, Low-delay B, and Random Access configurations, respectively.
Related papers
- Uniformly Accelerated Motion Model for Inter Prediction [38.34487653360328]
In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly.
In Versatile Video Coding (VVC), existing inter prediction methods assume uniform speed motion between consecutive frames.
We introduce a uniformly accelerated motion model (UAMM) to exploit motion-related elements (velocity, acceleration) of moving objects between the video frames.
arXiv Detail & Related papers (2024-07-16T09:46:29Z) - Multiscale Motion-Aware and Spatial-Temporal-Channel Contextual Coding
Network for Learned Video Compression [24.228981098990726]
We propose a motion-aware and spatial-temporal-channel contextual coding based video compression network (MASTC-VC)
Our proposed MASTC-VC is surprior to previous state-of-the-art (SOTA) methods on three public benchmark datasets.
Our method brings average 10.15% BD-rate savings against H.265/HEVC (HM-16.20) in PSNR metric and average 23.93% BD-rate savings against H.266/VVC (VTM-13.2) in MS-SSIM metric.
arXiv Detail & Related papers (2023-10-19T13:32:38Z) - PMI Sampler: Patch Similarity Guided Frame Selection for Aerial Action
Recognition [52.78234467516168]
We introduce the concept of patch mutual information (PMI) score to quantify the motion bias between adjacent frames.
We present an adaptive frame selection strategy using shifted leaky ReLu and cumulative distribution function.
Our method achieves a relative improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone, and 9.0% on Diving48 datasets.
arXiv Detail & Related papers (2023-04-14T00:01:11Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Self-Supervised Learning of Perceptually Optimized Block Motion
Estimates for Video Compression [50.48504867843605]
We propose a search-free block motion estimation framework using a multi-stage convolutional neural network.
We deploy the multi-scale structural similarity (MS-SSIM) loss function to optimize the perceptual quality of the motion compensated predicted frames.
arXiv Detail & Related papers (2021-10-05T03:38:43Z) - Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video.
Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream.
With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.