Structure-Preserving Patch Decoding for Efficient Neural Video Representation
- URL: http://arxiv.org/abs/2506.12896v2
- Date: Thu, 26 Jun 2025 12:27:09 GMT
- Title: Structure-Preserving Patch Decoding for Efficient Neural Video Representation
- Authors: Taiga Hayami, Kakeru Koizumi, Hiroshi Watanabe,
- Abstract summary: We propose a neural video representation method based on Structure-Preserving Patches (SPPs)<n>Our method separates each video frame into patch images of spatially aligned frames through a deterministic pixel-based splitting.<n>We train the decoder to reconstruct these structured patches, enabling a global-to-local decoding strategy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Implicit neural representations (INRs) are the subject of extensive research, particularly in their application to modeling complex signals by mapping spatial and temporal coordinates to corresponding values. When handling videos, mapping compact inputs to entire frames or spatially partitioned patch images is an effective approach. This strategy better preserves spatial relationships, reduces computational overhead, and improves reconstruction quality compared to coordinate-based mapping. However, predicting entire frames often limits the reconstruction of high-frequency visual details. Additionally, conventional patch-based approaches based on uniform spatial partitioning tend to introduce boundary discontinuities that degrade spatial coherence. We propose a neural video representation method based on Structure-Preserving Patches (SPPs) to address such limitations. Our method separates each video frame into patch images of spatially aligned frames through a deterministic pixel-based splitting similar to PixelUnshuffle. This operation preserves the global spatial structure while allowing patch-level decoding. We train the decoder to reconstruct these structured patches, enabling a global-to-local decoding strategy that captures the global layout first and refines local details. This effectively reduces boundary artifacts and mitigates distortions from naive upsampling. Experiments on standard video datasets demonstrate that our method achieves higher reconstruction quality and better compression performance than existing INR-based baselines.
Related papers
- CANeRV: Content Adaptive Neural Representation for Video Compression [89.35616046528624]
We propose Content Adaptive Neural Representation for Video Compression (CANeRV)<n>CANeRV is an innovative INR-based video compression network that adaptively conducts structure optimisation based on the specific content of each video sequence.<n>We show that CANeRV can outperform both H.266/VVC and state-of-the-art INR-based video compression techniques across diverse video datasets.
arXiv Detail & Related papers (2025-02-10T06:21:16Z) - CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation [11.364753833652182]
Implicit Neural Representation (INR) is a promising alternative to traditional transform-based methodologies.<n>We introduce CoordFlow, a novel pixel-wise INR for video compression.<n>It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques.
arXiv Detail & Related papers (2025-01-01T22:58:06Z) - High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency.
Uses hierarchical predictive coding to transform each video frame into multiscale representations.
Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z) - Boosting Neural Representations for Videos with a Conditional Decoder [28.073607937396552]
Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing.
This paper introduces a universal boosting framework for current implicit video representation approaches.
arXiv Detail & Related papers (2024-02-28T08:32:19Z) - SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene
Reconstruction [66.69049158826677]
We introduce SCALAR-NeRF, a novel framework tailored for scalable large-scale neural scene reconstruction.
We structure the neural representation as an encoder-decoder architecture, where the encoder processes 3D point coordinates to produce encoded features.
We propose an effective and efficient methodology to fuse the outputs from these local models to attain the final reconstruction.
arXiv Detail & Related papers (2023-11-28T10:18:16Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos [5.958701846880935]
We propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos.
With model compression techniques, FFNeRV outperforms widely-used standard video codecs (H.264 and HEVC) and performs on par with state-of-the-art video compression algorithms.
arXiv Detail & Related papers (2022-12-23T12:51:42Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Spatially-Adaptive Image Restoration using Distortion-Guided Networks [51.89245800461537]
We present a learning-based solution for restoring images suffering from spatially-varying degradations.
We propose SPAIR, a network design that harnesses distortion-localization information and dynamically adjusts to difficult regions in the image.
arXiv Detail & Related papers (2021-08-19T11:02:25Z) - ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring [92.40655035360729]
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions.
We propose a novel implicit method to learn spatial correspondence among blurry frames in the feature space.
Our proposed method is evaluated on the widely-adopted DVD dataset, along with a newly collected High-Frame-Rate (1000 fps) dataset for Video Deblurring.
arXiv Detail & Related papers (2021-03-07T04:33:13Z) - Learning Spatial and Spatio-Temporal Pixel Aggregations for Image and
Video Denoising [104.59305271099967]
We present a pixel aggregation network and learn the pixel sampling and averaging strategies for image denoising.
We develop a pixel aggregation network for video denoising to sample pixels across the spatial-temporal space.
Our method is able to solve the misalignment issues caused by large motion in dynamic scenes.
arXiv Detail & Related papers (2021-01-26T13:00:46Z) - Decomposition, Compression, and Synthesis (DCS)-based Video Coding: A
Neural Exploration via Resolution-Adaptive Learning [30.54722074562783]
We decompose the input video into respective spatial texture frames (STF) at its native spatial resolution.
Then, we compress them together using any popular video coder.
Finally, we synthesize decoded STFs and TMFs for high-quality video reconstruction at the same resolution as its native input.
arXiv Detail & Related papers (2020-12-01T17:23:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.