Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition
- URL: http://arxiv.org/abs/2207.01297v4
- Date: Sun, 26 Mar 2023 16:28:26 GMT
- Title: Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition
- Authors: Wenhao Wu, Zhun Sun, Wanli Ouyang
- Abstract summary: Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
- Score: 102.93524173258487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transferring knowledge from task-agnostic pre-trained deep models for
downstream tasks is an important topic in computer vision research. Along with
the growth of computational capacity, we now have open-source vision-language
pre-trained models in large scales of the model architecture and amount of
data. In this study, we focus on transferring knowledge for video
classification tasks. Conventional methods randomly initialize the linear
classifier head for vision classification, but they leave the usage of the text
encoder for downstream visual recognition tasks undiscovered. In this paper, we
revise the role of the linear classifier and replace the classifier with the
different knowledge from pre-trained model. We utilize the well-pretrained
language model to generate good semantic target for efficient transferring
learning. The empirical study shows that our method improves both the
performance and the training speed of video classification, with a negligible
change in the model. Our simple yet effective tuning paradigm achieves
state-of-the-art performance and efficient training on various video
recognition scenarios, i.e., zero-shot, few-shot, general recognition. In
particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on
Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1
accuracy under zero-shot, few-shot settings on five popular video datasets.
Code and models can be found at https://github.com/whwu95/Text4Vis .
Related papers
- VoltaVision: A Transfer Learning model for electronic component classification [1.4132765964347058]
We introduce a lightweight CNN, coined as VoltaVision, and compare its performance against more complex models.
We test the hypothesis that transferring knowledge from a similar task to our target domain yields better results than state-of-the-art models trained on general datasets.
arXiv Detail & Related papers (2024-04-05T05:42:23Z) - Adversarial Augmentation Training Makes Action Recognition Models More
Robust to Realistic Video Distribution Shifts [13.752169303624147]
Action recognition models often lack robustness when faced with natural distribution shifts between training and test data.
We propose two novel evaluation methods to assess model resilience to such distribution disparity.
We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models.
arXiv Detail & Related papers (2024-01-21T05:50:39Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - LiT Tuned Models for Efficient Species Detection [22.3395465641384]
Our paper introduces a simple methodology for adapting any fine-grained image classification dataset for distributed vision-language pretraining.
We implement this methodology on the challenging iNaturalist-2021 dataset, comprised of approximately 2.7 million images of macro-organisms across 10,000 classes.
Our model (trained using a new method called locked-image text tuning) uses a pre-trained, frozen vision representation, proving that language alignment alone can attain strong transfer learning performance.
arXiv Detail & Related papers (2023-02-12T20:36:55Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.