UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event
Boundary Detection
- URL: http://arxiv.org/abs/2111.14799v2
- Date: Tue, 30 Nov 2021 02:29:38 GMT
- Title: UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event
Boundary Detection
- Authors: Hyolim Kang, Jinwoo Kim, Taehyun Kim, Seon Joo Kim
- Abstract summary: Generic Event Boundary Detection (GEBD) aims to find one level deeper semantic boundaries of events.
We propose a novel framework for unsupervised/supervised GEBD, using the Temporal Self-similarity Matrix (TSM) as the video representation.
Our framework can be applied to both unsupervised and supervised settings, with both achieving state-of-the-art performance by a huge margin.
- Score: 27.29169136392871
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generic Event Boundary Detection (GEBD) is a newly suggested video
understanding task that aims to find one level deeper semantic boundaries of
events. Bridging the gap between natural human perception and video
understanding, it has various potential applications, including interpretable
and semantically valid video parsing. Still at an early development stage,
existing GEBD solvers are simple extensions of relevant video understanding
tasks, disregarding GEBD's distinctive characteristics. In this paper, we
propose a novel framework for unsupervised/supervised GEBD, by using the
Temporal Self-similarity Matrix (TSM) as the video representation. The new
Recursive TSM Parsing (RTP) algorithm exploits local diagonal patterns in TSM
to detect boundaries, and it is combined with the Boundary Contrastive (BoCo)
loss to train our encoder to generate more informative TSMs. Our framework can
be applied to both unsupervised and supervised settings, with both achieving
state-of-the-art performance by a huge margin in GEBD benchmark. Especially,
our unsupervised method outperforms the previous state-of-the-art "supervised"
model, implying its exceptional efficacy.
Related papers
- Rethinking the Architecture Design for Efficient Generic Event Boundary Detection [71.50748944513379]
Generic (GEBD) is inspired by human visual cognitive cognitive behaviors of consistently segmenting videos into meaningful temporal chunks.
SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios.
We experimentally reexamining the architecture of GEBD models and contribute to addressing this challenge.
arXiv Detail & Related papers (2024-07-17T14:49:54Z) - Fine-grained Dynamic Network for Generic Event Boundary Detection [9.17191007695011]
We propose a novel dynamic pipeline for generic event boundaries named DyBDet.
By introducing a multi-exit network architecture, DyBDet automatically learns the allocation to different video snippets.
Experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks.
arXiv Detail & Related papers (2024-07-05T06:02:46Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Unified Domain Adaptive Semantic Segmentation [96.74199626935294]
Unsupervised Adaptive Domain Semantic (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain.
We propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies.
Our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks.
arXiv Detail & Related papers (2023-11-22T09:18:49Z) - Motion Aware Self-Supervision for Generic Event Boundary Detection [14.637933739152315]
Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries.
Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices.
We revisit a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task.
arXiv Detail & Related papers (2022-10-11T16:09:13Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z) - Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z) - Winning the CVPR'2021 Kinetics-GEBD Challenge: Contrastive Learning
Approach [27.904987752334314]
We introduce a novel contrastive learning based approach to deal with the Generic Event Boundary Detection task.
In our model, Temporal Self-similarity Matrix (TSM) is utilized as an intermediate representation which takes on a role as an information bottleneck.
arXiv Detail & Related papers (2021-06-22T05:21:59Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.