Motion Aware Self-Supervision for Generic Event Boundary Detection
- URL: http://arxiv.org/abs/2210.05574v2
- Date: Wed, 12 Oct 2022 09:59:27 GMT
- Title: Motion Aware Self-Supervision for Generic Event Boundary Detection
- Authors: Ayush K. Rai, Tarun Krishna, Julia Dietlmeier, Kevin McGuinness, Alan
F. Smeaton, Noel E. O'Connor
- Abstract summary: Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries.
Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices.
We revisit a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task.
- Score: 14.637933739152315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of Generic Event Boundary Detection (GEBD) aims to detect moments in
videos that are naturally perceived by humans as generic and taxonomy-free
event boundaries. Modeling the dynamically evolving temporal and spatial
changes in a video makes GEBD a difficult problem to solve. Existing approaches
involve very complex and sophisticated pipelines in terms of architectural
design choices, hence creating a need for more straightforward and simplified
approaches. In this work, we address this issue by revisiting a simple and
effective self-supervised method and augment it with a differentiable motion
feature learning module to tackle the spatial and temporal diversities in the
GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD
and TAPOS datasets to demonstrate the efficacy of the proposed approach
compared to the other self-supervised state-of-the-art methods. We also show
that this simple self-supervised approach learns motion features without any
explicit motion-specific pretext task.
Related papers
- Fine-grained Dynamic Network for Generic Event Boundary Detection [9.17191007695011]
We propose a novel dynamic pipeline for generic event boundaries named DyBDet.
By introducing a multi-exit network architecture, DyBDet automatically learns the allocation to different video snippets.
Experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks.
arXiv Detail & Related papers (2024-07-05T06:02:46Z) - Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method [63.49140028965778]
We present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions.
To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion.
We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions.
arXiv Detail & Related papers (2024-03-24T14:24:13Z) - FLD: Fourier Latent Dynamics for Structured Motion Representation and
Learning [19.491968038335944]
We introduce a self-supervised, structured representation and generation method that extracts spatial-temporal relationships in periodic or quasi-periodic motions.
Our work opens new possibilities for future advancements in general motion representation and learning algorithms.
arXiv Detail & Related papers (2024-02-21T13:59:21Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - AntPivot: Livestream Highlight Detection via Hierarchical Attention
Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem.
We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Guidance and Teaching Network for Video Salient Object Detection [38.22880271210646]
We propose a simple yet efficient architecture, termed Guidance and Teaching Network (GTNet)
GTNet distils effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level.
This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities.
arXiv Detail & Related papers (2021-05-21T03:25:38Z) - Sequential convolutional network for behavioral pattern extraction in
gait recognition [0.7874708385247353]
We propose a sequential convolutional network (SCN) to learn the walking pattern of individuals.
In SCN, behavioral information extractors (BIE) are constructed to comprehend intermediate feature maps in time series.
A multi-frame aggregator in SCN performs feature integration on a sequence whose length is uncertain, via a mobile 3D convolutional layer.
arXiv Detail & Related papers (2021-04-23T08:44:10Z) - Event-based Motion Segmentation with Spatio-Temporal Graph Cuts [51.17064599766138]
We have developed a method to identify independently objects acquired with an event-based camera.
The method performs on par or better than the state of the art without having to predetermine the number of expected moving objects.
arXiv Detail & Related papers (2020-12-16T04:06:02Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.