Self-Regulated Learning for Egocentric Video Activity Anticipation
- URL: http://arxiv.org/abs/2111.11631v1
- Date: Tue, 23 Nov 2021 03:29:18 GMT
- Title: Self-Regulated Learning for Egocentric Video Activity Anticipation
- Authors: Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian
- Abstract summary: Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
- Score: 147.9783215348252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Future activity anticipation is a challenging problem in egocentric vision.
As a standard future activity anticipation paradigm, recursive sequence
prediction suffers from the accumulation of errors. To address this problem, we
propose a simple and effective Self-Regulated Learning framework, which aims to
regulate the intermediate representation consecutively to produce
representation that (a) emphasizes the novel information in the frame of the
current time-stamp in contrast to previously observed content, and (b) reflects
its correlation with previously observed frames. The former is achieved by
minimizing a contrastive loss, and the latter can be achieved by a dynamic
reweighing mechanism to attend to informative frames in the observed content
with a similarity comparison between feature of the current frame and observed
frames. The learned final video representation can be further enhanced by
multi-task learning which performs joint feature learning on the target
activity labels and the automatically detected action and object class tokens.
SRL sharply outperforms existing state-of-the-art in most cases on two
egocentric video datasets and two third-person video datasets. Its
effectiveness is also verified by the experimental fact that the action and
object concepts that support the activity semantics can be accurately
identified.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from
Video [61.21388780334379]
This work focuses on the apparent emotional reaction recognition from the video-only input, conducted in a self-supervised fashion.
The network is first pre-trained on different self-supervised pretext tasks and later fine-tuned on the downstream target task.
arXiv Detail & Related papers (2022-10-20T15:21:51Z) - Learning State-Aware Visual Representations from Audible Interactions [39.08554113807464]
We propose a self-supervised algorithm to learn representations from egocentric video data.
We use audio signals to identify moments of likely interactions which are conducive to better learning.
We validate these contributions extensively on two large-scale egocentric datasets.
arXiv Detail & Related papers (2022-09-27T17:57:13Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z) - Action Localization through Continual Predictive Learning [14.582013761620738]
We present a new approach based on continual learning that uses feature-level predictions for self-supervision.
We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames.
This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization.
arXiv Detail & Related papers (2020-03-26T23:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.