Action Localization through Continual Predictive Learning
- URL: http://arxiv.org/abs/2003.12185v1
- Date: Thu, 26 Mar 2020 23:32:43 GMT
- Title: Action Localization through Continual Predictive Learning
- Authors: Sathyanarayanan N. Aakur, Sudeep Sarkar
- Abstract summary: We present a new approach based on continual learning that uses feature-level predictions for self-supervision.
We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames.
This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization.
- Score: 14.582013761620738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The problem of action recognition involves locating the action in the video,
both over time and spatially in the image. The dominant current approaches use
supervised learning to solve this problem, and require large amounts of
annotated training data, in the form of frame-level bounding box annotations
around the region of interest. In this paper, we present a new approach based
on continual learning that uses feature-level predictions for self-supervision.
It does not require any training annotations in terms of frame-level bounding
boxes. The approach is inspired by cognitive models of visual event perception
that propose a prediction-based approach to event understanding. We use a stack
of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to
model the events in the video and use this model to predict high-level features
for the future frames. The prediction errors are used to continuously learn the
parameters of the models. This self-supervised framework is not complicated as
other approaches but is very effective in learning robust visual
representations for both labeling and localization. It should be noted that the
approach outputs in a streaming fashion, requiring only a single pass through
the video, making it amenable for real-time processing. We demonstrate this on
three datasets - UCF Sports, JHMDB, and THUMOS'13 and show that the proposed
approach outperforms weakly-supervised and unsupervised baselines and obtains
competitive performance compared to fully supervised baselines. Finally, we
show that the proposed framework can generalize to egocentric videos and obtain
state-of-the-art results in unsupervised gaze prediction.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes [19.987151025364067]
This paper presents a new semi-supervised method for training a reliable crowd counting model.
We foster the model's intrinsic'subitizing' capability, which allows it to accurately estimate the count in regions.
Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks.
arXiv Detail & Related papers (2023-10-16T12:42:43Z) - Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [111.49781716597984]
We propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
We can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
arXiv Detail & Related papers (2023-04-06T18:00:04Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.