PatchBlender: A Motion Prior for Video Transformers
- URL: http://arxiv.org/abs/2211.14449v1
- Date: Fri, 11 Nov 2022 14:43:16 GMT
- Title: PatchBlender: A Motion Prior for Video Transformers
- Authors: Gabriele Prato, Yale Song, Janarthanan Rajendran, R Devon Hjelm, Neel
Joshi, Sarath Chandar
- Abstract summary: We introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space.
We show that our method is successful at enabling vision transformers to encode the temporal component of video data.
- Score: 35.47505911122298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have become one of the dominant architectures in the field of
computer vision. However, there are yet several challenges when applying such
architectures to video data. Most notably, these models struggle to model the
temporal patterns of video data effectively. Directly targeting this issue, we
introduce PatchBlender, a learnable blending function that operates over patch
embeddings across the temporal dimension of the latent space. We show that our
method is successful at enabling vision transformers to encode the temporal
component of video data. On Something-Something v2 and MOVi-A, we show that our
method improves the performance of a ViT-B. PatchBlender has the advantage of
being compatible with almost any Transformer architecture and since it is
learnable, the model can adaptively turn on or off the prior. It is also
extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.
Related papers
- TDViT: Temporal Dilated Video Transformer for Dense Video Tasks [35.16197118579414]
Temporal Dilated Video Transformer (TDTTB) can efficiently extract video representations and effectively alleviate the negative effect of temporal redundancy.
Experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video segmentation instance.
arXiv Detail & Related papers (2024-02-14T15:41:07Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Patch-based Object-centric Transformers for Efficient Video Generation [71.55412580325743]
We present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture.
We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos.
Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information.
arXiv Detail & Related papers (2022-06-08T16:29:59Z) - VDTR: Video Deblurring with Transformer [24.20183395758706]
Videoblurring is still an unsolved problem due to the challenging-temporal modeling process.
This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
arXiv Detail & Related papers (2022-04-17T14:22:14Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.