A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames
- URL: http://arxiv.org/abs/2312.07395v1
- Date: Tue, 12 Dec 2023 16:10:19 GMT
- Title: A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames
- Authors: Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe
Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman,
Aida Nematzdeh
- Abstract summary: We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
- Score: 54.90226700939778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding long, real-world videos requires modeling of long-range visual
dependencies. To this end, we explore video-first architectures, building on
the common paradigm of transferring large-scale, image--text models to video
via shallow temporal fusion. However, we expose two limitations to the
approach: (1) decreased spatial capabilities, likely due to poor
video--language alignment in standard video datasets, and (2) higher memory
consumption, bottlenecking the number of frames that can be processed. To
mitigate the memory bottleneck, we systematically analyze the memory/accuracy
trade-off of various efficient methods: factorized attention,
parameter-efficient image-to-video adaptation, input masking, and
multi-resolution patchification. Surprisingly, simply masking large portions of
the video (up to 75%) during contrastive pre-training proves to be one of the
most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our
simple approach for training long video-to-text models, which scales to 1B
parameters, does not add new architectural complexity and is able to outperform
the popular paradigm of using much larger LLMs as an information aggregator
over segment-based information on benchmarks with long-range temporal
dependencies (YouCook2, EgoSchema).
Related papers
- VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
Video LMMs rely on either image or video encoders to process visual inputs, each of which has its own limitations.
We introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling)
Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.
arXiv Detail & Related papers (2024-06-13T17:59:59Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame.
Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block.
We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.