Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning
- URL: http://arxiv.org/abs/2104.14131v1
- Date: Thu, 29 Apr 2021 06:06:58 GMT
- Title: Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning
- Authors: Sathyanarayanan N. Aakur, Sudeep Sarkar
- Abstract summary: Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
- Score: 18.757368441841123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event perception tasks such as recognizing and localizing actions in
streaming videos are essential for tackling visual understanding tasks.
Progress has primarily been driven by the use of large-scale, annotated
training data in a supervised manner. In this work, we tackle the problem of
learning \textit{actor-centered} representations through the notion of
continual hierarchical predictive learning to localize actions in streaming
videos without any training annotations. Inspired by cognitive theories of
event perception, we propose a novel, self-supervised framework driven by the
notion of hierarchical predictive learning to construct actor-centered features
by attention-based contextualization. Extensive experiments on three benchmark
datasets show that the approach can learn robust representations for localizing
actions using only one epoch of training, i.e., we train the model continually
in streaming fashion - one frame at a time, with a single pass through training
videos. We show that the proposed approach outperforms unsupervised and weakly
supervised baselines while offering competitive performance to fully supervised
approaches. Finally, we show that the proposed model can generalize to
out-of-domain data without significant loss in performance without any
finetuning for both the recognition and localization tasks.
Related papers
- ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Action Shuffling for Weakly Supervised Temporal Localization [22.43209053892713]
This paper analyzes the order-sensitive and location-insensitive properties of actions.
It embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance.
arXiv Detail & Related papers (2021-05-10T09:05:58Z) - Teaching with Commentaries [108.62722733649542]
We propose a flexible teaching framework using commentaries and learned meta-information.
We find that commentaries can improve training speed and/or performance.
commentaries can be reused when training new models to obtain performance benefits.
arXiv Detail & Related papers (2020-11-05T18:52:46Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Action Localization through Continual Predictive Learning [14.582013761620738]
We present a new approach based on continual learning that uses feature-level predictions for self-supervision.
We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames.
This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization.
arXiv Detail & Related papers (2020-03-26T23:32:43Z) - Weakly-Supervised Multi-Level Attentional Reconstruction Network for
Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.
Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.
We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.