TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks
- URL: http://arxiv.org/abs/2011.11479v3
- Date: Tue, 17 Aug 2021 17:47:31 GMT
- Title: TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks
- Authors: Humam Alwassel, Silvio Giancola, Bernard Ghanem
- Abstract summary: We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
- Score: 79.01176229586855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the large memory footprint of untrimmed videos, current
state-of-the-art video localization methods operate atop precomputed video clip
features. These features are extracted from video encoders typically trained
for trimmed action classification tasks, making such features not necessarily
suitable for temporal localization. In this work, we propose a novel supervised
pretraining paradigm for clip features that not only trains to classify
activities but also considers background clips and global video information to
improve temporal sensitivity. Extensive experiments show that using features
trained with our novel pretraining strategy significantly improves the
performance of recent state-of-the-art methods on three tasks: Temporal Action
Localization, Action Proposal Generation, and Dense Video Captioning. We also
show that our pretraining approach is effective across three encoder
architectures and two pretraining datasets. We believe video feature encoding
is an important building block for localization algorithms, and extracting
temporally-sensitive features should be of paramount importance in building
more accurate models. The code and pretrained models are available on our
project website.
Related papers
- Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding [40.60371529725805]
We propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation.
We introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLM for better discovering action-sensitive patterns.
arXiv Detail & Related papers (2024-08-13T04:18:32Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Spatio-Temporal Crop Aggregation for Video Representation Learning [33.296154476701055]
Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
arXiv Detail & Related papers (2022-11-30T14:43:35Z) - Contrastive Language-Action Pre-training for Temporal Localization [64.34349213254312]
Long-form video understanding requires approaches that are able to temporally localize activities or language.
These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations.
We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions.
arXiv Detail & Related papers (2022-04-26T13:17:50Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.