Tragedy Plus Time: Capturing Unintended Human Activities from
Weakly-labeled Videos
- URL: http://arxiv.org/abs/2204.13548v1
- Date: Thu, 28 Apr 2022 14:56:43 GMT
- Title: Tragedy Plus Time: Capturing Unintended Human Activities from
Weakly-labeled Videos
- Authors: Arnav Chakravarthy, Zhiyuan Fang, Yezhou Yang
- Abstract summary: W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations.
We propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video.
- Score: 31.1632730473261
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In videos that contain actions performed unintentionally, agents do not
achieve their desired goals. In such videos, it is challenging for computer
vision systems to understand high-level concepts such as goal-directed
behavior, an ability present in humans from a very early age. Inculcating this
ability in artificially intelligent agents would make them better social
learners by allowing them to evaluate human action under a teleological lens.
To validate the ability of deep learning models to perform this task, we curate
the W-Oops dataset, built upon the Oops dataset [15]. W-Oops consists of 2,100
unintentional human action videos, with 44 goal-directed and 30 unintentional
video-level activity labels collected through human annotations. Due to the
expensive segment annotation procedure, we propose a weakly supervised
algorithm for localizing the goal-directed as well as unintentional temporal
regions in the video leveraging solely video-level labels. In particular, we
employ an attention mechanism-based strategy that predicts the temporal regions
which contribute the most to a classification task. Meanwhile, our designed
overlap regularization allows the model to focus on distinct portions of the
video for inferring the goal-directed and unintentional activity while
guaranteeing their temporal ordering. Extensive quantitative experiments verify
the validity of our localization method. We further conduct a video captioning
experiment which demonstrates that the proposed localization module does indeed
assist teleological action understanding.
Related papers
- Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention [10.149523817328921]
We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input.
Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention.
Our method outperforms state-of-the-art techniques, achieving a 7% improvement in accuracy for 18-class intention recognition.
arXiv Detail & Related papers (2024-04-10T21:03:23Z) - No More Shortcuts: Realizing the Potential of Temporal Self-Supervision [69.59938105887538]
We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks.
We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
arXiv Detail & Related papers (2023-12-20T13:20:31Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z) - Leveraging Self-Supervised Training for Unintentional Action Recognition [82.19777933440143]
We seek to identify the points in videos where the actions transition from intentional to unintentional.
We propose a multi-stage framework that exploits inherent biases such as motion speed, motion direction, and order to recognize unintentional actions.
arXiv Detail & Related papers (2022-09-23T21:36:36Z) - Stochastic Coherence Over Attention Trajectory For Continuous Learning
In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream.
The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations.
Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z) - Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z) - Learning Goals from Failure [30.071336708348472]
We introduce a framework that predicts the goals behind observable human action in video.
Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision.
arXiv Detail & Related papers (2020-06-28T17:16:49Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.