Related papers: Self-supervised Learning for Unintentional Action Prediction

Self-supervised Learning for Unintentional Action Prediction

URL: http://arxiv.org/abs/2209.12074v1
Date: Sat, 24 Sep 2022 19:06:46 GMT
Title: Self-supervised Learning for Unintentional Action Prediction
Authors: Olga Zatsarynna, Yazan Abu Farha, Juergen Gall
Abstract summary: We study the problem of self-supervised representation learning for unintentional action prediction. We show that the global context of a video is needed to learn a good representation for the three downstream tasks. In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well.
Score: 23.1028903711402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distinguishing if an action is performed as intended or if an intended action fails is an important skill that not only humans have, but that is also important for intelligent systems that operate in human environments. Recognizing if an action is unintentional or anticipating if an action will fail, however, is not straightforward due to lack of annotated data. While videos of unintentional or failed actions can be found in the Internet in abundance, high annotation costs are a major bottleneck for learning networks for these tasks. In this work, we thus study the problem of self-supervised representation learning for unintentional action prediction. While previous works learn the representation based on a local temporal neighborhood, we show that the global context of a video is needed to learn a good representation for the three downstream tasks: unintentional action classification, localization and anticipation. In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well.

Related papers

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z)
Leveraging Self-Supervised Training for Unintentional Action Recognition [82.19777933440143]
We seek to identify the points in videos where the actions transition from intentional to unintentional. We propose a multi-stage framework that exploits inherent biases such as motion speed, motion direction, and order to recognize unintentional actions.
arXiv Detail & Related papers (2022-09-23T21:36:36Z)
Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos [31.1632730473261]
W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. We propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video.
arXiv Detail & Related papers (2022-04-28T14:56:43Z)
Stochastic Coherence Over Attention Trajectory For Continuous Learning In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream. The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations. Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z)
Learning Actor-centered Representations for Action Localization in Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks. We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning. Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z)
Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. Our framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
Adding Knowledge to Unsupervised Algorithms for the Recognition of Intent [3.0079490585515343]
We derive an algorithm that can infer whether the behavior of an agent in a scene is intentional or unintentional based on its 3D kinematics. We show how the addition of this basic knowledge leads to a simple, unsupervised algorithm. Experiments on these datasets show that our algorithm can recognize whether an action is intentional or not, even without training data.
arXiv Detail & Related papers (2020-11-12T05:57:09Z)
Unsupervised Learning of Video Representations via Dense Trajectory Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
Learning Goals from Failure [30.071336708348472]
We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision.
arXiv Detail & Related papers (2020-06-28T17:16:49Z)
Evolving Losses for Unsupervised Video Representation Learning [91.2683362199263]
We present a new method to learn video representations from large-scale unlabeled video data. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods.
arXiv Detail & Related papers (2020-02-26T16:56:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.