LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders
- URL: http://arxiv.org/abs/2504.03501v1
- Date: Fri, 04 Apr 2025 14:56:27 GMT
- Title: LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders
- Authors: Ilan Naiman, Emanuel Ben-Baruch, Oron Anschel, Alon Shoshan, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni,
- Abstract summary: We introduce long-video masked-embedding autoencoders (LV-MAE)<n>LV-MAE is a self-supervised learning framework for long representation.<n>We achieve state-of-the-art results on three long-video benchmarks.
- Score: 9.996331443220651
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation. Our approach treats short- and long-span dependencies as two separate tasks. Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. To achieve this, we leverage advanced off-the-shelf multimodal encoders to extract representations from short segments within the long video, followed by pre-training a masked-embedding autoencoder capturing high-level interactions across segments. LV-MAE is highly efficient to train and enables the processing of much longer videos by alleviating the constraint on the number of input frames. Furthermore, unlike existing methods that typically pre-train on short-video datasets, our approach offers self-supervised pre-training using long video samples (e.g., 20+ minutes video clips) at scale. Using LV-MAE representations, we achieve state-of-the-art results on three long-video benchmarks -- LVU, COIN, and Breakfast -- employing only a simple classification head for either attentive or linear probing. Finally, to assess LV-MAE pre-training and visualize its reconstruction quality, we leverage the video-language aligned space of short video representations to monitor LV-MAE through video-text retrieval.
Related papers
- Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - $\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation [19.616624959353697]
$infty$-Video can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism.<n>Our framework augments video Q-formers by allowing them to process video contexts efficiently and without requiring additional training.
arXiv Detail & Related papers (2025-01-31T12:45:46Z) - Video LLMs for Temporal Reasoning in Long Videos [7.2900856926028155]
TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.<n>Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.<n>To facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM.
arXiv Detail & Related papers (2024-12-04T00:50:33Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.