Self-supervised Learning for Unintentional Action Prediction
- URL: http://arxiv.org/abs/2209.12074v1
- Date: Sat, 24 Sep 2022 19:06:46 GMT
- Title: Self-supervised Learning for Unintentional Action Prediction
- Authors: Olga Zatsarynna, Yazan Abu Farha, Juergen Gall
- Abstract summary: We study the problem of self-supervised representation learning for unintentional action prediction.
We show that the global context of a video is needed to learn a good representation for the three downstream tasks.
In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well.
- Score: 23.1028903711402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distinguishing if an action is performed as intended or if an intended action
fails is an important skill that not only humans have, but that is also
important for intelligent systems that operate in human environments.
Recognizing if an action is unintentional or anticipating if an action will
fail, however, is not straightforward due to lack of annotated data. While
videos of unintentional or failed actions can be found in the Internet in
abundance, high annotation costs are a major bottleneck for learning networks
for these tasks. In this work, we thus study the problem of self-supervised
representation learning for unintentional action prediction. While previous
works learn the representation based on a local temporal neighborhood, we show
that the global context of a video is needed to learn a good representation for
the three downstream tasks: unintentional action classification, localization
and anticipation. In the supplementary material, we show that the learned
representation can be used for detecting anomalies in videos as well.
Related papers
- What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Leveraging Self-Supervised Training for Unintentional Action Recognition [82.19777933440143]
We seek to identify the points in videos where the actions transition from intentional to unintentional.
We propose a multi-stage framework that exploits inherent biases such as motion speed, motion direction, and order to recognize unintentional actions.
arXiv Detail & Related papers (2022-09-23T21:36:36Z) - Tragedy Plus Time: Capturing Unintended Human Activities from
Weakly-labeled Videos [31.1632730473261]
W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations.
We propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video.
arXiv Detail & Related papers (2022-04-28T14:56:43Z) - Stochastic Coherence Over Attention Trajectory For Continuous Learning
In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream.
The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations.
Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z) - Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Adding Knowledge to Unsupervised Algorithms for the Recognition of
Intent [3.0079490585515343]
We derive an algorithm that can infer whether the behavior of an agent in a scene is intentional or unintentional based on its 3D kinematics.
We show how the addition of this basic knowledge leads to a simple, unsupervised algorithm.
Experiments on these datasets show that our algorithm can recognize whether an action is intentional or not, even without training data.
arXiv Detail & Related papers (2020-11-12T05:57:09Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Learning Goals from Failure [30.071336708348472]
We introduce a framework that predicts the goals behind observable human action in video.
Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision.
arXiv Detail & Related papers (2020-06-28T17:16:49Z) - Evolving Losses for Unsupervised Video Representation Learning [91.2683362199263]
We present a new method to learn video representations from large-scale unlabeled video data.
The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods.
arXiv Detail & Related papers (2020-02-26T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.