Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
Transformer-based Video Question Answering
- URL: http://arxiv.org/abs/2305.09107v1
- Date: Tue, 16 May 2023 02:12:57 GMT
- Title: Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
Transformer-based Video Question Answering
- Authors: Chenyang Lyu, Tianbo Ji, Yvette Graham, Jennifer Foster
- Abstract summary: Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question.
We present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we video frames to a $ntimes n$ matrix and then convert it to one image.
- Score: 14.659023742381777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional Transformer-based Video Question Answering (VideoQA) approaches
generally encode frames independently through one or more image encoders
followed by interaction between frames and question. However, such schema would
incur significant memory use and inevitably slow down the training and
inference speed. In this work, we present a highly efficient approach for
VideoQA based on existing vision-language pre-trained models where we
concatenate video frames to a $n\times n$ matrix and then convert it to one
image. By doing so, we reduce the use of the image encoder from $n^{2}$ to $1$
while maintaining the temporal structure of the original video. Experimental
results on MSRVTT and TrafficQA show that our proposed approach achieves
state-of-the-art performance with nearly $4\times$ faster speed and only 30%
memory use. We show that by integrating our approach into VideoQA systems we
can achieve comparable, even superior, performance with a significant speed up
for training and inference. We believe the proposed approach can facilitate
VideoQA-related research by reducing the computational requirements for those
who have limited access to budgets and resources. Our code will be made
publicly available for research use.
Related papers
- Fast Encoding and Decoding for Implicit Video Representation [88.43612845776265]
We introduce NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, a parallel decoder for efficient video loading.
NeRV-Enc achieves an impressive speed-up of $mathbf104times$ by eliminating gradient-based optimization.
NeRV-Dec simplifies video decoding, outperforming conventional codecs with a loading speed $mathbf11times$ faster.
arXiv Detail & Related papers (2024-09-28T18:21:52Z) - FlashVideo: A Framework for Swift Inference in Text-to-Video Generation [9.665089218030086]
This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation.
FlashVideo reduces the time complexity of inference from $mathcalO(L2)$ to $mathcalO(L)$ for a sequence of length $L$, significantly accelerating inference speed.
Our comprehensive experiments demonstrate that FlashVideo achieves a $times9.17$ improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.
arXiv Detail & Related papers (2023-12-30T00:06:28Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Vision-Language Models Learn Super Images for Efficient Partially
Relevant Video Retrieval [2.303098021872002]
We propose an efficient and high-performance method for partially relevant video retrieval.
It aims to retrieve long videos that contain at least one moment relevant to the input text query.
arXiv Detail & Related papers (2023-12-01T08:38:27Z) - Predictive Coding For Animation-Based Video Compression [13.161311799049978]
We propose a predictive coding scheme which uses image animation as a predictor, and codes the residual with respect to the actual target frame.
Our experiments indicate a significant gain, in excess of 70% compared to the HEVC video standard and over 30% compared to VVC.
arXiv Detail & Related papers (2023-07-09T14:40:54Z) - Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
Long-Term Video Recognition [74.35009770905968]
We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models.
MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
arXiv Detail & Related papers (2022-01-20T18:59:54Z) - Conditional Entropy Coding for Efficient Video Compression [82.35389813794372]
We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames.
We first show that a simple architecture modeling the entropy between the image latent codes is as competitive as other neural video compression works and video codecs.
We then propose a novel internal learning extension on top of this architecture that brings an additional 10% savings without trading off decoding speed.
arXiv Detail & Related papers (2020-08-20T20:01:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.