SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion
- URL: http://arxiv.org/abs/2504.08012v2
- Date: Wed, 16 Apr 2025 01:18:13 GMT
- Title: SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion
- Authors: Yuseon Kim, Kyongseok Park,
- Abstract summary: VP (VP) model integrates standard attention (SA) and reinforced attention (RFA)<n> Experiments on three benchmark datasets demonstrate that SRV mitigates image quality degradation in RNN-based models.
- Score: 0.18416014644193066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video prediction (VP) generates future frames by leveraging spatial representations and temporal context from past frames. Traditional recurrent neural network (RNN)-based models enhance memory cell structures to capture spatiotemporal states over extended durations but suffer from gradual loss of object appearance details. To address this issue, we propose the strong recollection VP (SRVP) model, which integrates standard attention (SA) and reinforced feature attention (RFA) modules. Both modules employ scaled dot-product attention to extract temporal context and spatial correlations, which are then fused to enhance spatiotemporal representations. Experiments on three benchmark datasets demonstrate that SRVP mitigates image quality degradation in RNN-based models while achieving predictive performance comparable to RNN-free architectures.
Related papers
- RepVideo: Rethinking Cross-Layer Representation for Video Generation [53.701548524818534]
We propose RepVideo, an enhanced representation framework for text-to-video diffusion models.
By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information.
Our experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, but also improves temporal consistency in video generation.
arXiv Detail & Related papers (2025-01-15T18:20:37Z) - Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors [80.92195378575671]
We describe a strong baseline for Arbitra-scale super-resolution (AVSR)
We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network.
Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-theart.
arXiv Detail & Related papers (2024-07-13T15:27:39Z) - Enhancing Adaptive History Reserving by Spiking Convolutional Block
Attention Module in Recurrent Neural Networks [21.509659756334802]
Spiking neural networks (SNNs) serve as one type of efficient model to processtemporal-temporal patterns in time series.
In this paper, we develop a recurrent spiking neural network (RSNN) model embedded with an advanced spiking convolutional attention module (SCBAM) component.
It invokes the history information in spatial and temporal channels adaptively through SCBAM which brings the advantages of efficient memory calling history and redundancy elimination.
arXiv Detail & Related papers (2024-01-08T08:05:34Z) - Delayed Memory Unit: Modelling Temporal Dependency Through Delay Gate [16.4160685571157]
Recurrent Neural Networks (RNNs) are widely recognized for their proficiency in modeling temporal dependencies.
This paper proposes a novel Delayed Memory Unit (DMU) for gated RNNs.
The DMU incorporates a delay line structure along with delay gates into vanilla RNN, thereby enhancing temporal interaction and facilitating temporal credit assignment.
arXiv Detail & Related papers (2023-10-23T14:29:48Z) - Attention-based Spatial-Temporal Graph Convolutional Recurrent Networks
for Traffic Forecasting [12.568905377581647]
Traffic forecasting is one of the most fundamental problems in transportation science and artificial intelligence.
Existing methods cannot accurately model both long-term and short-term temporal correlations simultaneously.
We propose a novel spatial-temporal neural network framework, which consists of a graph convolutional recurrent module (GCRN) and a global attention module.
arXiv Detail & Related papers (2023-02-25T03:37:00Z) - STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction [78.129039340528]
We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems.
The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions.
Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T09:49:04Z) - STDAN: Deformable Attention Network for Space-Time Video
Super-Resolution [39.18399652834573]
We propose a deformable attention network called STDAN for STVSR.
First, we devise a long-short term feature (LSTFI) module, which is capable of abundant content from more neighboring input frames.
Second, we put forward a spatial-temporal deformable feature aggregation (STDFA) module, in which spatial and temporal contexts are adaptively captured and aggregated.
arXiv Detail & Related papers (2022-03-14T03:40:35Z) - Recurrence-in-Recurrence Networks for Video Deblurring [58.49075799159015]
State-of-the-art video deblurring methods often adopt recurrent neural networks to model the temporal dependency between the frames.
In this paper, we propose recurrence-in-recurrence network architecture to cope with the limitations of short-ranged memory.
arXiv Detail & Related papers (2022-03-12T11:58:13Z) - Recur, Attend or Convolve? Frame Dependency Modeling Matters for
Cross-Domain Robustness in Action Recognition [0.5448283690603357]
Previous results have shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape for various computer vision tasks.
This raises suspicion that large video models learn spurious correlations rather than to track relevant shapes over time.
We study the cross-domain robustness for recurrent, attention-based and convolutional video models, respectively, to investigate whether this robustness is influenced by the frame dependency modeling.
arXiv Detail & Related papers (2021-12-22T19:11:53Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - Group-based Bi-Directional Recurrent Wavelet Neural Networks for Video
Super-Resolution [4.9136996406481135]
Video super-resolution (VSR) aims to estimate a high-resolution (HR) frame from a low-resolution (LR) frames.
Key challenge for VSR lies in the effective exploitation of spatial correlation in an intra-frame and temporal dependency between consecutive frames.
arXiv Detail & Related papers (2021-06-14T06:36:13Z) - PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive
Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context.
We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.