Early Action Recognition with Action Prototypes
- URL: http://arxiv.org/abs/2312.06598v1
- Date: Mon, 11 Dec 2023 18:31:13 GMT
- Title: Early Action Recognition with Action Prototypes
- Authors: Guglielmo Camporese, Alessandro Bergamo, Xunyu Lin, Joseph Tighe,
Davide Modolo
- Abstract summary: We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
- Score: 62.826125870298306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Early action recognition is an important and challenging problem that enables
the recognition of an action from a partially observed video stream where the
activity is potentially unfinished or even not started. In this work, we
propose a novel model that learns a prototypical representation of the full
action for each class and uses it to regularize the architecture and the visual
representations of the partial observations. Our model is very simple in design
and also efficient. We decompose the video into short clips, where a visual
encoder extracts features from each clip independently. Later, a decoder
aggregates together in an online fashion features from all the clips for the
final class prediction. During training, for each partial observation, the
model is jointly trained to both predict the label as well as the action
prototypical representation which acts as a regularizer. We evaluate our method
on multiple challenging real-world datasets and outperform the current
state-of-the-art by a significant margin. For example, on early recognition
observing only the first 10% of each video, our method improves the SOTA by
+2.23 Top-1 accuracy on Something-Something-v2, +3.55 on UCF-101, +3.68 on
SSsub21, and +5.03 on EPIC-Kitchens-55, where prior work used either
multi-modal inputs (e.g. optical-flow) or batched inference. Finally, we also
present exhaustive ablation studies to motivate the design choices we made, as
well as gather insights regarding what our model is learning semantically.
Related papers
- HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - An Empirical Study of End-to-End Temporal Action Detection [82.64373812690127]
Temporal action detection (TAD) is an important yet challenging task in video understanding.
Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm.
We validate the advantage of end-to-end learning over head-only learning and observe up to 11% performance improvement.
arXiv Detail & Related papers (2022-04-06T16:46:30Z) - Self-supervised Video Representation Learning with Cross-Stream
Prototypical Contrasting [2.2530496464901106]
"Video Cross-Stream Prototypical Contrasting" is a novel method which predicts consistent prototype assignments from both RGB and optical flow views.
We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition.
arXiv Detail & Related papers (2021-06-18T13:57:51Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.