CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation
- URL: http://arxiv.org/abs/2109.15170v1
- Date: Thu, 30 Sep 2021 14:40:32 GMT
- Title: CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation
- Authors: Xiao Wang, Jingen Liu, Tao Mei, Jiebo Luo
- Abstract summary: We propose an end-to-end self-supervised learning framework for event segmentation/boundary detection.
Our framework exploits a transformer-based feature reconstruction scheme to detect event boundary by reconstruction errors.
The goal of our work is to segment generic events rather than localize some specific ones.
- Score: 118.18977078626776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Some cognitive research has discovered that humans accomplish event
segmentation as a side effect of event anticipation. Inspired by this
discovery, we propose a simple yet effective end-to-end self-supervised
learning framework for event segmentation/boundary detection. Unlike the
mainstream clustering-based methods, our framework exploits a transformer-based
feature reconstruction scheme to detect event boundary by reconstruction
errors. This is consistent with the fact that humans spot new events by
leveraging the deviation between their prediction and what is actually
perceived. Thanks to their heterogeneity in semantics, the frames at boundaries
are difficult to be reconstructed (generally with large reconstruction errors),
which is favorable for event boundary detection. Additionally, since the
reconstruction occurs on the semantic feature level instead of pixel level, we
develop a temporal contrastive feature embedding module to learn the semantic
visual representation for frame feature reconstruction. This procedure is like
humans building up experiences with "long-term memory". The goal of our work is
to segment generic events rather than localize some specific ones. We focus on
achieving accurate event boundaries. As a result, we adopt F1 score
(Precision/Recall) as our primary evaluation metric for a fair comparison with
previous approaches. Meanwhile, we also calculate the conventional frame-based
MoF and IoU metric. We thoroughly benchmark our work on four publicly available
datasets and demonstrate much better results.
Related papers
- Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras [45.063747874243276]
We present EV-WSSS: a novel weakly supervised approach for event-based semantic segmentation.
The proposed framework performs asymmetric dual-student learning between 1) the original forward event data and 2) the longer reversed event data.
We show that the proposed method achieves substantial segmentation results even without relying on pixel-level dense ground truths.
arXiv Detail & Related papers (2024-07-15T20:00:50Z) - Visual Context-Aware Person Fall Detection [52.49277799455569]
We present a segmentation pipeline to semi-automatically separate individuals and objects in images.
Background objects such as beds, chairs, or wheelchairs can challenge fall detection systems, leading to false positive alarms.
We demonstrate that object-specific contextual transformations during training effectively mitigate this challenge.
arXiv Detail & Related papers (2024-04-11T19:06:36Z) - AttenScribble: Attentive Similarity Learning for Scribble-Supervised
Medical Image Segmentation [5.8447004333496855]
In this paper, we present a straightforward yet effective scribble supervised learning framework.
We create a pluggable spatial self-attention module which could be attached on top of any internal feature layers of arbitrary fully convolutional network (FCN) backbone.
This attentive similarity leads to a novel regularization loss that imposes consistency between segmentation prediction and visual affinity.
arXiv Detail & Related papers (2023-12-11T18:42:18Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Prototypical Kernel Learning and Open-set Foreground Perception for
Generalized Few-shot Semantic Segmentation [7.707161030443157]
Generalized Few-shot Semantic (GFSS) extends Few-shot Semantic aggregation to segment unseen classes and seen classes during evaluation.
We address the aforementioned problems by jointing the prototypical kernel learning and open-set perception.
In addition, a foreground contextual perception module cooperating with conditional bias based inference is adopted to perform class-agnostic as well as open-set foreground detection.
arXiv Detail & Related papers (2023-08-09T13:38:52Z) - Structured Context Transformer for Generic Event Boundary Detection [32.09242716244653]
We present Structured Context Transformer (or SC-Transformer) to solve the Generic Event Boundary Detection task.
We use the backbone convolutional neural network (CNN) to extract the features of each video frame.
A lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps.
arXiv Detail & Related papers (2022-06-07T03:00:24Z) - Unsupervised Part Discovery from Contrastive Reconstruction [90.88501867321573]
The goal of self-supervised visual representation learning is to learn strong, transferable image representations.
We propose an unsupervised approach to object part discovery and segmentation.
Our method yields semantic parts consistent across fine-grained but visually distinct categories.
arXiv Detail & Related papers (2021-11-11T17:59:42Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Generic Event Boundary Detection: A Benchmark for Event Segmentation [21.914662894860474]
This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks.
We introduce the task of Generic Event Boundary Detection (GEBD) and the new benchmark Kinetics-GEBD.
Inspired by the cognitive finding that humans mark boundaries at points where they are unable to predict the future accurately, we explore un-supervised approaches.
arXiv Detail & Related papers (2021-01-26T01:31:30Z) - Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem
Formulation [53.850686395708905]
Event-based cameras record an asynchronous stream of per-pixel brightness changes.
In this paper, we focus on single-layer architectures for representation learning from event data.
We show improvements of up to 9 % in the recognition accuracy compared to the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-23T10:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.