FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action
Recognition
- URL: http://arxiv.org/abs/2402.03241v1
- Date: Mon, 5 Feb 2024 17:56:41 GMT
- Title: FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action
Recognition
- Authors: Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han
- Abstract summary: We introduce FROSTER, an effective framework for open-vocabulary action recognition.
Applying CLIP directly to the action recognition task is challenging due to the absence of temporal information in CLIP's pretraining.
We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings.
- Score: 30.15770881713811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce FROSTER, an effective framework for
open-vocabulary action recognition. The CLIP model has achieved remarkable
success in a range of image-based tasks, benefiting from its strong
generalization capability stemming from pretaining on massive image-text pairs.
However, applying CLIP directly to the open-vocabulary action recognition task
is challenging due to the absence of temporal information in CLIP's
pretraining. Further, fine-tuning CLIP on action recognition datasets may lead
to overfitting and hinder its generalizability, resulting in unsatisfactory
results when dealing with unseen actions.
To address these issues, FROSTER employs a residual feature distillation
approach to ensure that CLIP retains its generalization capability while
effectively adapting to the action recognition task. Specifically, the residual
feature distillation treats the frozen CLIP model as a teacher to maintain the
generalizability exhibited by the original CLIP and supervises the feature
learning for the extraction of video-specific features to bridge the gap
between images and videos. Meanwhile, it uses a residual sub-network for
feature distillation to reach a balance between the two distinct objectives of
learning generalizable and video-specific features.
We extensively evaluate FROSTER on open-vocabulary action recognition
benchmarks under both base-to-novel and cross-dataset settings. FROSTER
consistently achieves state-of-the-art performance on all datasets across the
board. Project page: https://visual-ai.github.io/froster.
Related papers
- Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer-language representations.
SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - CLFace: A Scalable and Resource-Efficient Continual Learning Framework for Lifelong Face Recognition [0.0]
CLFace is a continual learning framework designed to preserve and incrementally extend the learned knowledge.
It eliminates the classification layer, resulting in a resource-efficient FR model that remains fixed throughout lifelong learning.
It incorporates a geometry-preserving distillation scheme to maintain the orientation of the teacher model's feature embedding.
arXiv Detail & Related papers (2024-11-21T06:55:43Z) - FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance [7.041364616661048]
Foveal-Attention CLIP (FALIP) adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module.
FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition.
arXiv Detail & Related papers (2024-07-08T03:23:13Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP
Limitations [9.444540281544715]
We introduce a novel agent for active open-vocabulary recognition.
The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features, without relying on class-specific knowledge.
arXiv Detail & Related papers (2023-11-28T19:24:07Z) - CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - On Exploring Pose Estimation as an Auxiliary Learning Task for
Visible-Infrared Person Re-identification [66.58450185833479]
In this paper, we exploit Pose Estimation as an auxiliary learning task to assist the VI-ReID task in an end-to-end framework.
By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality modality-shared and ID-related features.
Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-01-11T09:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.