TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding
- URL: http://arxiv.org/abs/2310.19060v1
- Date: Sun, 29 Oct 2023 16:25:32 GMT
- Title: TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding
- Authors: Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou
- Abstract summary: TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame.
Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block.
We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
- Score: 20.16000249533665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale video-language pre-training has made remarkable strides in
advancing video-language understanding tasks. However, the heavy computational
burden of video encoding remains a formidable efficiency bottleneck,
particularly for long-form videos. These videos contain massive visual tokens
due to their inherent 3D properties and spatiotemporal redundancy, making it
challenging to capture complex temporal and spatial relationships. To tackle
this issue, we propose an efficient method called TEmporal-Spatial Token
Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating
similar frames, as well as similar patches within each frame. TESTA can reduce
the number of visual tokens by 75% and thus accelerate video encoding. Building
upon TESTA, we introduce a pre-trained video-language model equipped with a
divided space-time token aggregation module in each video encoder block. We
evaluate our model on five datasets for paragraph-to-video retrieval and
long-form VideoQA tasks. Experimental results show that TESTA improves
computing efficiency by 1.7 times, and achieves significant performance gains
from its scalability in processing longer input frames, e.g., +13.7 R@1 on
QuerYD and +6.5 R@1 on Condensed Movie.
Related papers
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding [20.037781644877388]
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding.
Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths.
arXiv Detail & Related papers (2023-12-04T17:09:52Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.