Related papers: STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing

STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing

URL: http://arxiv.org/abs/2411.10198v1
Date: Fri, 15 Nov 2024 13:53:19 GMT
Title: STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing
Authors: Andrea Alfarano, Alberto Alfarano, Linda Friso, Andrea Bacciu, Irene Amerini, Fabrizio Silvestri,
Abstract summary: We propose STLight, a novel method for S-temporal learning that relies solely on channel-wise and depth-wise convolutions as learnable layers. STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together. Our architecture achieves state-of-the-art performance on STL benchmarks across datasets and settings, while significantly improving computational efficiency in terms of parameters and computational FLOPs.
Score: 6.872340834265972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatio-Temporal predictive Learning is a self-supervised learning paradigm that enables models to identify spatial and temporal patterns by predicting future frames based on past frames. Traditional methods, which use recurrent neural networks to capture temporal patterns, have proven their effectiveness but come with high system complexity and computational demand. Convolutions could offer a more efficient alternative but are limited by their characteristic of treating all previous frames equally, resulting in poor temporal characterization, and by their local receptive field, limiting the capacity to capture distant correlations among frames. In this paper, we propose STLight, a novel method for spatio-temporal learning that relies solely on channel-wise and depth-wise convolutions as learnable layers. STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together, using a single convolution to mix both types of features into a comprehensive spatio-temporal patch representation. This representation is then processed in a purely convolutional framework, capable of focusing simultaneously on the interaction among near and distant patches, and subsequently allowing for efficient reconstruction of the predicted frames. Our architecture achieves state-of-the-art performance on STL benchmarks across different datasets and settings, while significantly improving computational efficiency in terms of parameters and computational FLOPs. The code is publicly available

Related papers

Adaptive Fuzzy Time Series Forecasting via Partially Asymmetric Convolution and Sub-Sliding Window Fusion [0.0]
We propose a novel convolutional architecture with partially asymmetric design based on the time of sliding window.<n>The proposed method achieves state-of-the-art results on most of popular time series datasets.
arXiv Detail & Related papers (2025-07-28T08:58:25Z)
Multivariate Long-term Time Series Forecasting with Fourier Neural Filter [55.09326865401653]
We introduce FNF as the backbone and DBD as architecture to provide excellent learning capabilities and optimal learning pathways for spatial-temporal modeling.<n>We show that FNF unifies local time-domain and global frequency-domain information processing within a single backbone that extends naturally to spatial modeling.
arXiv Detail & Related papers (2025-06-10T18:40:20Z)
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting [16.782154479264126]
Predicting backbone-temporal traffic flow presents challenges due to complex interactions between temporal factors. Existing approaches address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce Sanonymous-Temporal Unitized Unitized Cell (ASTUC), a unified framework designed to capture both spatial and temporal dependencies.
arXiv Detail & Related papers (2024-11-14T07:34:31Z)
A Unified Framework for Neural Computation and Learning Over Time [56.44910327178975]
Hamiltonian Learning is a novel unified framework for learning with neural networks "over time" It is based on differential equations that: (i) can be integrated without the need of external software solvers; (ii) generalize the well-established notion of gradient-based learning in feed-forward and recurrent networks; (iii) open to novel perspectives.
arXiv Detail & Related papers (2024-09-18T14:57:13Z)
LaT-PFN: A Joint Embedding Predictive Architecture for In-context Time-series Forecasting [0.0]
We introduce LatentTimePFN, a foundational Time Series model with a strong embedding space that enables zero-shot forecasting. We perform in-context learning in latent space utilizing a novel integration of the Prior-data Fitted Networks (PFN) and Joint Embedding Predictive Architecture (JEPA) frameworks.
arXiv Detail & Related papers (2024-05-16T13:44:56Z)
StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences [31.210626775505407]
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation. We present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks. StreamFlow not only excels in terms of performance on challenging KITTI and Sintel datasets, with particular improvement in occluded areas.
arXiv Detail & Related papers (2023-11-28T07:53:51Z)
TIDE: Temporally Incremental Disparity Estimation via Pattern Flow in Structured Light System [17.53719804060679]
TIDE-Net is a learning-based technique for disparity computation in mono-camera structured light systems. We exploit the deformation of projected patterns (named pattern flow) on captured image sequences to model the temporal information. For each incoming frame, our model fuses correlation volumes (from current frame) and disparity (from former frame) warped by pattern flow.
arXiv Detail & Related papers (2023-10-13T07:55:33Z)
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos. The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z)
Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z)
OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning [67.07363529640784]
We propose OpenSTL to categorize prevalent approaches into recurrent-based and recurrent-free models. We conduct standard evaluations on datasets across various domains, including synthetic moving object trajectory, human motion, driving scenes, traffic flow and forecasting weather. We find that recurrent-free models achieve a good balance between efficiency and performance than recurrent models.
arXiv Detail & Related papers (2023-06-20T03:02:14Z)
Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision. This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z)
Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.