Related papers: Leveraging Self-Supervised Training for Unintentional Action Recognition

Leveraging Self-Supervised Training for Unintentional Action Recognition

URL: http://arxiv.org/abs/2209.11870v1
Date: Fri, 23 Sep 2022 21:36:36 GMT
Title: Leveraging Self-Supervised Training for Unintentional Action Recognition
Authors: Enea Duka, Anna Kukleva, Bernt Schiele
Abstract summary: We seek to identify the points in videos where the actions transition from intentional to unintentional. We propose a multi-stage framework that exploits inherent biases such as motion speed, motion direction, and order to recognize unintentional actions.
Score: 82.19777933440143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unintentional actions are rare occurrences that are difficult to define precisely and that are highly dependent on the temporal context of the action. In this work, we explore such actions and seek to identify the points in videos where the actions transition from intentional to unintentional. We propose a multi-stage framework that exploits inherent biases such as motion speed, motion direction, and order to recognize unintentional actions. To enhance representations via self-supervised training for the task of unintentional action recognition we propose temporal transformations, called Temporal Transformations of Inherent Biases of Unintentional Actions (T2IBUA). The multi-stage approach models the temporal information on both the level of individual frames and full clips. These enhanced representations show strong performance for unintentional action recognition tasks. We provide an extensive ablation study of our framework and report results that significantly improve over the state-of-the-art.

Related papers

Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization [66.80402022104074]
Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (textiti.e., labeling only a single frame per action instance) to train a model to locate action instances within unsupervised videos.<n>Most existing approaches design the task head of models with only a point-trimmed snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action.<n>We propose a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization.
arXiv Detail & Related papers (2026-02-05T14:46:21Z)
Exploring Ordinal Bias in Action Recognition for Instructional Videos [18.30575202965141]
Action recognition models often rely on dominant, dataset-specific action sequences rather than true video comprehension. We propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments.
arXiv Detail & Related papers (2025-04-09T05:03:51Z)
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision [69.59938105887538]
We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
arXiv Detail & Related papers (2023-12-20T13:20:31Z)
Action Sensitivity Learning for Temporal Action Localization [35.65086250175736]
We propose an Action Sensitivity Learning framework (ASL) to tackle the task of temporal action localization. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. Based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames.
arXiv Detail & Related papers (2023-05-25T04:19:14Z)
DIR-AS: Decoupling Individual Identification and Temporal Reasoning for Action Segmentation [84.78383981697377]
Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue. We develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention. We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-04T20:27:18Z)
Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos [31.1632730473261]
W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. We propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video.
arXiv Detail & Related papers (2022-04-28T14:56:43Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
CLTA: Contents and Length-based Temporal Attention for Few-shot Action Recognition [2.0349696181833337]
We propose a Contents and Length-based Temporal Attention model, which learns customized temporal attention for the individual video. We show that even a not fine-tuned backbone with an ordinary softmax classifier can still achieve similar or better results compared to the state-of-the-art few-shot action recognition.
arXiv Detail & Related papers (2021-03-18T23:40:28Z)
Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)
Inferring Temporal Compositions of Actions Using Probabilistic Automata [61.09176771931052]
We propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata. Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences.
arXiv Detail & Related papers (2020-04-28T00:15:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.