Real-time Online Video Detection with Temporal Smoothing Transformers
- URL: http://arxiv.org/abs/2209.09236v1
- Date: Mon, 19 Sep 2022 17:59:02 GMT
- Title: Real-time Online Video Detection with Temporal Smoothing Transformers
- Authors: Yue Zhao and Philipp Kr\"ahenb\"uhl
- Abstract summary: A good streaming recognition model captures both long-term dynamics and short-term changes of video.
To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel.
We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
- Score: 4.545986838009774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Streaming video recognition reasons about objects and their actions in every
frame of a video. A good streaming recognition model captures both long-term
dynamics and short-term changes of video. Unfortunately, in most existing
methods, the computational complexity grows linearly or quadratically with the
length of the considered dynamics. This issue is particularly pronounced in
transformer-based architectures. To address this issue, we reformulate the
cross-attention in a video transformer through the lens of kernel and apply two
kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The
resulting streaming attention reuses much of the computation from frame to
frame, and only requires a constant time update each frame. Based on this idea,
we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily
long inputs with constant caching and computing overhead. Specifically, it runs
$6\times$ faster than equivalent sliding-window based transformers with 2,048
frames in a streaming setting. Furthermore, thanks to the increased temporal
span, TeSTra achieves state-of-the-art results on THUMOS'14 and
EPIC-Kitchen-100, two standard online action detection and action anticipation
datasets. A real-time version of TeSTra outperforms all but one prior
approaches on the THUMOS'14 dataset.
Related papers
- No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding [38.60950616529459]
We propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as textitSqueezeTime, for mobile video understanding.
The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding.
arXiv Detail & Related papers (2024-05-14T06:32:40Z) - TDViT: Temporal Dilated Video Transformer for Dense Video Tasks [35.16197118579414]
Temporal Dilated Video Transformer (TDTTB) can efficiently extract video representations and effectively alleviate the negative effect of temporal redundancy.
Experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video segmentation instance.
arXiv Detail & Related papers (2024-02-14T15:41:07Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - Towards End-to-End Generative Modeling of Long Videos with
Memory-Efficient Bidirectional Transformers [13.355338760884583]
We propose Memory-directional Bi-efficient Transformer (MeBT) for end-to-end learning of long-term dependency in videos.
Our method learns to decode entire-temporal volume of a video in parallel from partially observed patches.
arXiv Detail & Related papers (2023-03-20T16:35:38Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z) - Long-term Leap Attention, Short-term Periodic Shift for Video
Classification [41.87505528859225]
Video transformer naturally incurs a heavier computation burden than a static vision transformer.
We propose the LAPS, a long-term textbftextitLeap Attention'' (LAN), short-term textbftextitPeriodic Shift'' (textitP-Shift) module for video transformers.
arXiv Detail & Related papers (2022-07-12T13:30:15Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.