FrameRS: A Video Frame Compression Model Composed by Self supervised
Video Frame Reconstructor and Key Frame Selector
- URL: http://arxiv.org/abs/2309.09083v1
- Date: Sat, 16 Sep 2023 19:30:05 GMT
- Title: FrameRS: A Video Frame Compression Model Composed by Self supervised
Video Frame Reconstructor and Key Frame Selector
- Authors: Qiqian Fu, Guanhong Wang, Gaoang Wang
- Abstract summary: We present frame reconstruction model: FrameRS. It consists self-supervised video frame reconstructor and key frame selector.
The frame reconstructor, FrameMAE, is developed by adapting the principles of the Masked Autoencoder for Images (MAE) for video context.
The key frame selector, Frame Selector, is built on CNN architecture.
- Score: 9.896415488558036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present frame reconstruction model: FrameRS. It consists
self-supervised video frame reconstructor and key frame selector. The frame
reconstructor, FrameMAE, is developed by adapting the principles of the Masked
Autoencoder for Images (MAE) for video context. The key frame selector, Frame
Selector, is built on CNN architecture. By taking the high-level semantic
information from the encoder of FrameMAE as its input, it can predicted the key
frames with low computation costs. Integrated with our bespoke Frame Selector,
FrameMAE can effectively compress a video clip by retaining approximately 30%
of its pivotal frames. Performance-wise, our model showcases computational
efficiency and competitive accuracy, marking a notable improvement over
traditional Key Frame Extract algorithms. The implementation is available on
Github
Related papers
- Frame-Voyager: Learning to Query Frames for Video Large Language Models [33.84793162102087]
Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks.
Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos.
We propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task.
arXiv Detail & Related papers (2024-10-04T08:26:06Z) - Concatenated Masked Autoencoders as Spatial-Temporal Learner [6.475592804311682]
We introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
We propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets.
arXiv Detail & Related papers (2023-11-02T03:08:26Z) - Predictive Coding For Animation-Based Video Compression [13.161311799049978]
We propose a predictive coding scheme which uses image animation as a predictor, and codes the residual with respect to the actual target frame.
Our experiments indicate a significant gain, in excess of 70% compared to the HEVC video standard and over 30% compared to VVC.
arXiv Detail & Related papers (2023-07-09T14:40:54Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Butterfly: Multiple Reference Frames Feature Propagation Mechanism for
Neural Video Compression [17.073251238499314]
We present a more reasonable multi-reference frames propagation mechanism for neural video compression.
Our method can significantly outperform the previous state-of-the-art (SOTA)
Our neural dataset can achieve -7.6% save on HEVC Class D when compared with our base single-reference frame model.
arXiv Detail & Related papers (2023-03-06T08:19:15Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Advancing Learned Video Compression with In-loop Frame Prediction [177.67218448278143]
In this paper, we propose an Advanced Learned Video Compression (ALVC) approach with the in-loop frame prediction module.
The predicted frame can serve as a better reference than the previously compressed frame, and therefore it benefits the compression performance.
The experiments show the state-of-the-art performance of our ALVC approach in learned video compression.
arXiv Detail & Related papers (2022-11-13T19:53:14Z) - Context-Aware Video Reconstruction for Rolling Shutter Cameras [52.28710992548282]
In this paper, we propose a context-aware GS video reconstruction architecture.
We first estimate the bilateral motion field so that the pixels of the two RS frames are warped to a common GS frame.
Then, a refinement scheme is proposed to guide the GS frame synthesis along with bilateral occlusion masks to produce high-fidelity GS video frames.
arXiv Detail & Related papers (2022-05-25T17:05:47Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - End-to-End Learning for Video Frame Compression with Self-Attention [25.23586503813838]
We propose an end-to-end learned system for compressing video frames.
Our system learns deep embeddings of frames and encodes their difference in latent space.
In our experiments, we show that the proposed system achieves high compression rates and high objective visual quality.
arXiv Detail & Related papers (2020-04-20T12:11:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.