Boundary-aware Self-supervised Learning for Video Scene Segmentation
- URL: http://arxiv.org/abs/2201.05277v1
- Date: Fri, 14 Jan 2022 02:14:07 GMT
- Title: Boundary-aware Self-supervised Learning for Video Scene Segmentation
- Authors: Jonghwan Mun, Minchul Shin, Gunsoo Han, Sangho Lee, Seongsu Ha,
Joonseok Lee, Eun-Sol Kim
- Abstract summary: Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
- Score: 20.713635723315527
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised learning has drawn attention through its effectiveness in
learning in-domain representations with no ground-truth annotations; in
particular, it is shown that properly designed pretext tasks (e.g., contrastive
prediction task) bring significant performance gains for downstream tasks
(e.g., classification task). Inspired from this, we tackle video scene
segmentation, which is a task of temporally localizing scene boundaries in a
video, with a self-supervised learning framework where we mainly focus on
designing effective pretext tasks. In our framework, we discover a
pseudo-boundary from a sequence of shots by splitting it into two continuous,
non-overlapping sub-sequences and leverage the pseudo-boundary to facilitate
the pre-training. Based on this, we introduce three novel boundary-aware
pretext tasks: 1) Shot-Scene Matching (SSM), 2) Contextual Group Matching (CGM)
and 3) Pseudo-boundary Prediction (PP); SSM and CGM guide the model to maximize
intra-scene similarity and inter-scene discrimination while PP encourages the
model to identify transitional moments. Through comprehensive analysis, we
empirically show that pre-training and transferring contextual representation
are both critical to improving the video scene segmentation performance.
Lastly, we achieve the new state-of-the-art on the MovieNet-SSeg benchmark. The
code is available at https://github.com/kakaobrain/bassl.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Set-Constrained Viterbi for Set-Supervised Action Segmentation [40.22433538226469]
This paper is about weakly supervised action segmentation.
The ground truth specifies only a set of actions present in a training video, but not their true temporal ordering.
We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths.
arXiv Detail & Related papers (2020-02-27T05:32:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.