REST: REtrieve & Self-Train for generative action recognition
- URL: http://arxiv.org/abs/2209.15000v1
- Date: Thu, 29 Sep 2022 17:57:01 GMT
- Title: REST: REtrieve & Self-Train for generative action recognition
- Authors: Adrian Bulat and Enrique Sanchez and Brais Martinez and Georgios
Tzimiropoulos
- Abstract summary: We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
- Score: 54.90704746573636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work is on training a generative action/video recognition model whose
output is a free-form action-specific caption describing the video (rather than
an action class label). A generative approach has practical advantages like
producing more fine-grained and human-readable output, and being naturally
open-world. To this end, we propose to adapt a pre-trained generative Vision &
Language (V&L) Foundation Model for video/action recognition. While recently
there have been a few attempts to adapt V&L models trained with contrastive
learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose
the very first method that sets outs to accomplish this goal for a generative
model. We firstly show that direct fine-tuning of a generative model to produce
action classes suffers from severe overfitting. To alleviate this, we introduce
REST, a training framework consisting of two key components: an unsupervised
method for adapting the generative model to action/video by means of
pseudo-caption generation and Self-training, i.e. without using any
action-specific labels; (b) a Retrieval approach based on CLIP for discovering
a diverse set of pseudo-captions for each video to train the model.
Importantly, we show that both components are necessary to obtain high
accuracy. We evaluate REST on the problem of zero-shot action recognition where
we show that our approach is very competitive when compared to contrastive
learning-based methods. Code will be made available.
Related papers
- AICL: Action In-Context Learning for Video Diffusion Model [124.39948693332552]
We propose AICL, which empowers the generative model with the ability to understand action information in reference videos.
Extensive experiments demonstrate that AICL effectively captures the action and achieves state-of-the-art generation performance.
arXiv Detail & Related papers (2024-03-18T07:41:19Z) - Adversarial Augmentation Training Makes Action Recognition Models More
Robust to Realistic Video Distribution Shifts [13.752169303624147]
Action recognition models often lack robustness when faced with natural distribution shifts between training and test data.
We propose two novel evaluation methods to assess model resilience to such distribution disparity.
We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models.
arXiv Detail & Related papers (2024-01-21T05:50:39Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Training and Evaluation of Deep Policies using Reinforcement Learning
and Generative Models [67.78935378952146]
GenRL is a framework for solving sequential decision-making problems.
It exploits the combination of reinforcement learning and latent variable generative models.
We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training.
arXiv Detail & Related papers (2022-04-18T22:02:32Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Partner-Assisted Learning for Few-Shot Image Classification [54.66864961784989]
Few-shot Learning has been studied to mimic human visual capabilities and learn effective models without the need of exhaustive human annotation.
In this paper, we focus on the design of training strategy to obtain an elemental representation such that the prototype of each novel class can be estimated from a few labeled samples.
We propose a two-stage training scheme, which first trains a partner encoder to model pair-wise similarities and extract features serving as soft-anchors, and then trains a main encoder by aligning its outputs with soft-anchors while attempting to maximize classification performance.
arXiv Detail & Related papers (2021-09-15T22:46:19Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.