Two-Stream Transformer Architecture for Long Video Understanding
- URL: http://arxiv.org/abs/2208.01753v1
- Date: Tue, 2 Aug 2022 21:03:48 GMT
- Title: Two-Stream Transformer Architecture for Long Video Understanding
- Authors: Edward Fish, Jon Weinbren, Andrew Gilbert
- Abstract summary: This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features.
Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
- Score: 5.001789577362836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pure vision transformer architectures are highly effective for short video
classification and action recognition tasks. However, due to the quadratic
complexity of self attention and lack of inductive bias, transformers are
resource intensive and suffer from data inefficiencies. Long form video
understanding tasks amplify data and memory efficiency problems in transformers
making current approaches unfeasible to implement on data or memory restricted
domains. This paper introduces an efficient Spatio-Temporal Attention Network
(STAN) which uses a two-stream transformer architecture to model dependencies
between static image features and temporal contextual features. Our proposed
approach can classify videos up to two minutes in length on a single GPU, is
data efficient, and achieves SOTA performance on several long video
understanding tasks.
Related papers
- Transformer-based Video Saliency Prediction with High Temporal Dimension
Decoding [12.595019348741042]
We propose a transformer-based video saliency prediction approach with high temporal dimension network decoding (THTDNet)
This architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
arXiv Detail & Related papers (2024-01-15T20:09:56Z) - TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Towards End-to-End Generative Modeling of Long Videos with
Memory-Efficient Bidirectional Transformers [13.355338760884583]
We propose Memory-directional Bi-efficient Transformer (MeBT) for end-to-end learning of long-term dependency in videos.
Our method learns to decode entire-temporal volume of a video in parallel from partially observed patches.
arXiv Detail & Related papers (2023-03-20T16:35:38Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - EgoViT: Pyramid Video Transformer for Egocentric Action Recognition [18.05706639179499]
Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos.
We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
arXiv Detail & Related papers (2023-03-15T20:33:50Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.