Contrastive Language-Action Pre-training for Temporal Localization
- URL: http://arxiv.org/abs/2204.12293v1
- Date: Tue, 26 Apr 2022 13:17:50 GMT
- Title: Contrastive Language-Action Pre-training for Temporal Localization
- Authors: Mengmeng Xu, Erhan Gundogdu, Maksim Lapin, Bernard Ghanem, Michael
Donoser, Loris Bazzani
- Abstract summary: Long-form video understanding requires approaches that are able to temporally localize activities or language.
These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations.
We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions.
- Score: 64.34349213254312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-form video understanding requires designing approaches that are able to
temporally localize activities or language. End-to-end training for such tasks
is limited by the compute device memory constraints and lack of temporal
annotations at large-scale. These limitations can be addressed by pre-training
on large datasets of temporally trimmed videos supervised by class annotations.
Once the video encoder is pre-trained, it is common practice to freeze it
during fine-tuning. Therefore, the video encoder does not learn temporal
boundaries and unseen classes, causing a domain gap with respect to the
downstream tasks. Moreover, using temporally trimmed videos prevents to capture
the relations between different action categories and the background context in
a video clip which results in limited generalization capacity. To address these
limitations, we propose a novel post-pre-training approach without freezing the
video encoder which leverages language. We introduce a masked contrastive
learning loss to capture visio-linguistic relations between activities,
background video clips and language in the form of captions. Our experiments
show that the proposed approach improves the state-of-the-art on temporal
action localization, few-shot temporal action localization, and video language
grounding tasks.
Related papers
- Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z) - Boundary-sensitive Pre-training for Temporal Localization in Videos [124.40788524169668]
We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task.
With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types.
Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
arXiv Detail & Related papers (2020-11-21T17:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.