Related papers: Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

URL: http://arxiv.org/abs/2207.05526v1
Date: Tue, 12 Jul 2022 13:30:15 GMT
Title: Long-term Leap Attention, Short-term Periodic Shift for Video Classification
Authors: Hao Zhang, Lechao Cheng, Yanbin Hao, Chong-Wah Ngo
Abstract summary: Video transformer naturally incurs a heavier computation burden than a static vision transformer. We propose the LAPS, a long-term textbftextitLeap Attention'' (LAN), short-term textbftextitPeriodic Shift'' (textitP-Shift) module for video transformers.
Score: 41.87505528859225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes $T$ times longer sequence than the latter under the current attention of quadratic complexity $(T^2N^2)$. The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term ``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video transformers, with $(2TN^2)$ complexity. Specifically, the ``LA'' groups long-term frames into pairs, then refactors each discrete pair via attention. The ``\textit{P}-Shift'' exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead ($\sim$2.6\%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in \sloppy \href{https://github.com/VideoNetworks/LAPS-transformer}{\textit{\color{magenta}{https://github.com/VideoNetworks/LAPS-transformer}}} .

Related papers

Fi$^2$VTS: Time Series Forecasting Via Capturing Intra- and Inter-Variable Variations in the Frequency Domain [6.61394789494625]
Time series forecasting (TSF) plays a crucial role in various applications, including medical monitoring and crop growth. We introduce the Fi$2$VBlock, which leverages a textbfFrequency domain perspective to capture textbfintra- and textbfinter- textbfVariations. Inception blocks are employed to integrate information, thus capturing correlations across different variables. Our backbone network, Fi$2$VTS, employs a residual architecture by concatenating multiple Fi$2$
arXiv Detail & Related papers (2024-07-31T01:50:39Z)
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging [46.23615648331571]
Hierarchical Separable Video Transformer (HiSViT) is a reconstruction architecture without temporal aggregation. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network ( GSM-FFN) Our method outperforms previous methods by $!>!0.5$ with comparable or fewer parameters and complexity.
arXiv Detail & Related papers (2024-07-16T17:35:59Z)
p-Laplacian Transformer [7.2541371193810384]
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. We first show that the self-attention mechanism obtains the minimal Laplacian regularization. We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
arXiv Detail & Related papers (2023-11-06T16:25:56Z)
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST) CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z)
SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world. No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies. We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z)
Real-time Online Video Detection with Temporal Smoothing Transformers [4.545986838009774]
A good streaming recognition model captures both long-term dynamics and short-term changes of video. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel. We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
arXiv Detail & Related papers (2022-09-19T17:59:02Z)
RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution [13.089535703790425]
Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce High-Frame-Rate (HFR) and also High-Resolution (HR) counterparts. We propose using a spatial-temporal transformer that naturally incorporates the spatial and temporal super resolution modules into a single model.
arXiv Detail & Related papers (2022-03-27T02:16:26Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.