Boundary-sensitive Pre-training for Temporal Localization in Videos
- URL: http://arxiv.org/abs/2011.10830v3
- Date: Fri, 26 Mar 2021 11:01:35 GMT
- Title: Boundary-sensitive Pre-training for Temporal Localization in Videos
- Authors: Mengmeng Xu, Juan-Manuel Perez-Rua, Victor Escorcia, Brais Martinez,
Xiatian Zhu, Li Zhang, Bernard Ghanem, Tao Xiang
- Abstract summary: We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task.
With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types.
Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
- Score: 124.40788524169668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many video analysis tasks require temporal localization thus detection of
content changes. However, most existing models developed for these tasks are
pre-trained on general video action classification tasks. This is because large
scale annotation of temporal boundaries in untrimmed videos is expensive.
Therefore no suitable datasets exist for temporal boundary-sensitive
pre-training. In this paper for the first time, we investigate model
pre-training for temporal localization by introducing a novel
boundary-sensitive pretext (BSP) task. Instead of relying on costly manual
annotations of temporal boundaries, we propose to synthesize temporal
boundaries in existing video action classification datasets. With the
synthesized boundaries, BSP can be simply conducted via classifying the
boundary types. This enables the learning of video representations that are
much more transferable to downstream temporal localization tasks. Extensive
experiments show that the proposed BSP is superior and complementary to the
existing action classification based pre-training counterpart, and achieves new
state-of-the-art performance on several temporal localization tasks.
Related papers
- Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Video Activity Localisation with Uncertainties in Temporal Boundary [74.7263952414899]
Methods for video activity localisation over time assume implicitly that activity temporal boundaries are determined and precise.
In unscripted natural videos, different activities transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time.
We introduce Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries.
arXiv Detail & Related papers (2022-06-26T16:45:56Z) - Contrastive Language-Action Pre-training for Temporal Localization [64.34349213254312]
Long-form video understanding requires approaches that are able to temporally localize activities or language.
These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations.
We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions.
arXiv Detail & Related papers (2022-04-26T13:17:50Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Transferable Knowledge-Based Multi-Granularity Aggregation Network for
Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge.
We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals.
We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos.
Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z) - SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal
Action Detection [32.159784061961886]
Temporal action detection (TAD) is a challenging task which aims to temporally localize and recognize the human action in untrimmed videos.
Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors.
A novel TAD model termed as Selective Receptive Field Network (SRF-Net) is developed.
arXiv Detail & Related papers (2021-06-29T11:29:16Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.