Motion Aware Self-Supervision for Generic Event Boundary Detection
- URL: http://arxiv.org/abs/2210.05574v2
- Date: Wed, 12 Oct 2022 09:59:27 GMT
- Title: Motion Aware Self-Supervision for Generic Event Boundary Detection
- Authors: Ayush K. Rai, Tarun Krishna, Julia Dietlmeier, Kevin McGuinness, Alan
F. Smeaton, Noel E. O'Connor
- Abstract summary: Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries.
Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices.
We revisit a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task.
- Score: 14.637933739152315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of Generic Event Boundary Detection (GEBD) aims to detect moments in
videos that are naturally perceived by humans as generic and taxonomy-free
event boundaries. Modeling the dynamically evolving temporal and spatial
changes in a video makes GEBD a difficult problem to solve. Existing approaches
involve very complex and sophisticated pipelines in terms of architectural
design choices, hence creating a need for more straightforward and simplified
approaches. In this work, we address this issue by revisiting a simple and
effective self-supervised method and augment it with a differentiable motion
feature learning module to tackle the spatial and temporal diversities in the
GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD
and TAPOS datasets to demonstrate the efficacy of the proposed approach
compared to the other self-supervised state-of-the-art methods. We also show
that this simple self-supervised approach learns motion features without any
explicit motion-specific pretext task.
Related papers
- Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues.
MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders.
We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z) - Object Style Diffusion for Generalized Object Detection in Urban Scene [69.04189353993907]
We introduce a novel single-domain object detection generalization method, named GoDiff.
By integrating pseudo-target domain data with source domain data, we diversify the training dataset.
Experimental results demonstrate that our method not only enhances the generalization ability of existing detectors but also functions as a plug-and-play enhancement for other single-domain generalization methods.
arXiv Detail & Related papers (2024-12-18T13:03:00Z) - Fine-grained Dynamic Network for Generic Event Boundary Detection [9.17191007695011]
We propose a novel dynamic pipeline for generic event boundaries named DyBDet.
By introducing a multi-exit network architecture, DyBDet automatically learns the allocation to different video snippets.
Experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks.
arXiv Detail & Related papers (2024-07-05T06:02:46Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - AntPivot: Livestream Highlight Detection via Hierarchical Attention
Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem.
We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Guidance and Teaching Network for Video Salient Object Detection [38.22880271210646]
We propose a simple yet efficient architecture, termed Guidance and Teaching Network (GTNet)
GTNet distils effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level.
This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities.
arXiv Detail & Related papers (2021-05-21T03:25:38Z) - Sequential convolutional network for behavioral pattern extraction in
gait recognition [0.7874708385247353]
We propose a sequential convolutional network (SCN) to learn the walking pattern of individuals.
In SCN, behavioral information extractors (BIE) are constructed to comprehend intermediate feature maps in time series.
A multi-frame aggregator in SCN performs feature integration on a sequence whose length is uncertain, via a mobile 3D convolutional layer.
arXiv Detail & Related papers (2021-04-23T08:44:10Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.