ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos
- URL: http://arxiv.org/abs/2407.12987v1
- Date: Wed, 17 Jul 2024 20:07:05 GMT
- Title: ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos
- Authors: Hyolim Kang, Jeongseok Hyun, Joungbin An, Youngjae Yu, Seon Joo Kim,
- Abstract summary: ActionSwitch is the first class-agnostic On-TAL framework capable of detecting overlapping actions.
By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations.
- Score: 35.371453530275666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes -- a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can address concurrent actions depend heavily on class information, limiting their flexibility. This paper introduces ActionSwitch, the first class-agnostic On-TAL framework capable of detecting overlapping actions. By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations, including overlapping actions of the same class or scenarios where class information is unavailable. This approach is complemented by the proposed "conservativeness loss", which directly embeds a conservative decision-making principle into the loss function for On-TAL. Our ActionSwitch achieves state-of-the-art performance in complex datasets, including Epic-Kitchens 100 targeting the challenging egocentric view and FineAction consisting of fine-grained actions.
Related papers
- Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO.
This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation caused by distractors.
Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
arXiv Detail & Related papers (2025-02-13T11:27:05Z) - 2by2: Weakly-Supervised Learning for Global Action Segmentation [4.880243880711163]
This paper presents a simple yet effective approach for the poorly investigated task of global action segmentation.
We propose to use activity labels to learn, in a weakly-supervised fashion, action representations suitable for global action segmentation.
For the backbone architecture, we use a Siamese network based on sparse transformers that takes as input video pairs and determine whether they belong to the same activity.
arXiv Detail & Related papers (2024-12-17T11:49:36Z) - One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features [2.8266810371534152]
Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach.
The proposed method achieves superior results compared to the other methods in both Open-vocab and Closed-vocab settings.
arXiv Detail & Related papers (2024-04-30T13:14:28Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization [148.55210919689986]
We introduce an Action-Context Separation Network (ACSNet) that takes into account context for accurate action localization.
ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.
arXiv Detail & Related papers (2021-03-28T09:20:54Z) - Discovering Multi-Label Actor-Action Association in a Weakly Supervised
Setting [22.86745487695168]
We propose a baseline based on multi-instance and multi-label learning.
We propose a novel approach that uses sets of actions as representation instead of modeling individual action classes.
We evaluate the proposed approach on the challenging dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
arXiv Detail & Related papers (2021-01-21T11:59:47Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z) - A Novel Online Action Detection Framework from Untrimmed Video Streams [19.895434487276578]
We propose a novel online action detection framework that considers actions as a set of temporally ordered subclasses.
We augment our data by varying the lengths of videos to allow the proposed method to learn about the high intra-class variation in human actions.
arXiv Detail & Related papers (2020-03-17T14:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.