An Image is Worth 16x16 Words, What is a Video Worth?
- URL: http://arxiv.org/abs/2103.13915v1
- Date: Thu, 25 Mar 2021 15:25:17 GMT
- Title: An Image is Worth 16x16 Words, What is a Video Worth?
- Authors: Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
- Abstract summary: Methods that reach State of the Art (SotA) accuracy usually make use of 3D convolution layers as a way to abstract the temporal information from video frames.
Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video.
We address the computational bottleneck by significantly reducing the number of frames required for inference.
- Score: 14.056790511123866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leading methods in the domain of action recognition try to distill
information from both the spatial and temporal dimensions of an input video.
Methods that reach State of the Art (SotA) accuracy, usually make use of 3D
convolution layers as a way to abstract the temporal information from video
frames. The use of such convolutions requires sampling short clips from the
input video, where each clip is a collection of closely sampled frames. Since
each short clip covers a small fraction of an input video, multiple clips are
sampled at inference in order to cover the whole temporal length of the video.
This leads to increased computational load and is impractical for real-world
applications. We address the computational bottleneck by significantly reducing
the number of frames required for inference. Our approach relies on a temporal
transformer that applies global attention over video frames, and thus better
exploits the salient information in each frame. Therefore our approach is very
input efficient, and can achieve SotA results (on Kinetics dataset) with a
fraction of the data (frames per video), computation and latency. Specifically
on Kinetics-400, we reach 78.8 top-1 accuracy with $\times 30$ less frames per
video, and $\times 40$ faster inference than the current leading method. Code
is available at: https://github.com/Alibaba-MIIL/STAM
Related papers
- Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
Transformer-based Video Question Answering [14.659023742381777]
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question.
We present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we video frames to a $ntimes n$ matrix and then convert it to one image.
arXiv Detail & Related papers (2023-05-16T02:12:57Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - Unsupervised Video Interpolation by Learning Multilayered 2.5D Motion
Fields [75.81417944207806]
This paper presents a self-supervised approach to video frame learning that requires only a single video.
We parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying motion field.
This implicit neural representation learns the video as a space-time continuum, allowing frame-time continuum at any temporal resolution.
arXiv Detail & Related papers (2022-04-21T06:17:05Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [77.52828273633646]
We present a new drop-in block for video transformers that aggregates information along implicitly determined motion paths.
We also propose a new method to address the quadratic dependence of computation and memory on the input size.
We obtain state-of-the-art results on the Kinetics, Something--Something V2, and Epic-Kitchens datasets.
arXiv Detail & Related papers (2021-06-09T21:16:05Z) - Video Instance Segmentation using Inter-Frame Communication Transformers [28.539742250704695]
Recently, the per-clip pipeline shows superior performance over per-frame methods.
Previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications.
We propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames.
arXiv Detail & Related papers (2021-06-07T02:08:39Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - No frame left behind: Full Video Action Recognition [26.37329995193377]
We propose full video action recognition and consider all video frames.
We first cluster all frame activations along the temporal dimension.
We then temporally aggregate the frames in the clusters into a smaller number of representations.
arXiv Detail & Related papers (2021-03-29T07:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.