Stand-Alone Inter-Frame Attention in Video Models
- URL: http://arxiv.org/abs/2206.06931v1
- Date: Tue, 14 Jun 2022 15:51:28 GMT
- Title: Stand-Alone Inter-Frame Attention in Video Models
- Authors: Fuchen Long and Zhaofan Qiu and Yingwei Pan and Ting Yao and Jiebo Luo
and Tao Mei
- Abstract summary: We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA)
SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames.
We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
- Score: 164.06137994796487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion, as the uniqueness of a video, has been critical to the development of
video understanding models. Modern deep learning models leverage motion by
either executing spatio-temporal 3D convolutions, factorizing 3D convolutions
into spatial and temporal convolutions separately, or computing self-attention
along temporal dimension. The implicit assumption behind such successes is that
the feature maps across consecutive frames can be nicely aggregated.
Nevertheless, the assumption may not always hold especially for the regions
with large deformation. In this paper, we present a new recipe of inter-frame
attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly
delves into the deformation across frames to estimate local self-attention on
each spatial location. Technically, SIFA remoulds the deformable design via
re-scaling the offset predictions by the difference between two frames. Taking
each spatial location in the current frame as the query, the locally deformable
neighbors in the next frame are regarded as the keys/values. Then, SIFA
measures the similarity between query and keys as stand-alone attention to
weighted average the values for temporal aggregation. We further plug SIFA
block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net
and SIFA-Transformer. Extensive experiments conducted on four video datasets
demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger
backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1% on
Kinetics-400 dataset. Source code is available at
\url{https://github.com/FuchenUSTC/SIFA}.
Related papers
- Improved Video VAE for Latent Video Diffusion Model [55.818110540710215]
Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora.
Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression.
We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
arXiv Detail & Related papers (2024-11-10T12:43:38Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - TTVFI: Learning Trajectory-Aware Transformer for Video Frame
Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames.
We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI)
Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - TAda! Temporally-Adaptive Convolutions for Video Understanding [17.24510667917993]
adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos.
TAdaConv empowers the spatial convolutions with temporal modelling abilities by calibrating the convolution weights for each frame according to its local and global temporal context.
We construct TAda2D networks by replacing spatial convolutions in ResNet with TAdaConv, which leads to on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks.
arXiv Detail & Related papers (2021-10-12T17:25:07Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.