Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
- URL: http://arxiv.org/abs/2504.01890v2
- Date: Mon, 07 Apr 2025 08:59:15 GMT
- Title: Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
- Authors: Shreyank N Gowda, Boyan Gao, Xiao Gu, Xiaobo Jin,
- Abstract summary: We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture.<n>TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data.
- Score: 11.47868206641396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.
Related papers
- Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder [0.6437284704257459]
We present a new Factorized Graph Sequence network that runs in real-time and scales effectively in the temporal dimension.<n>We also introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings.<n>Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3% and 5.6% improvement in F1-macro score.
arXiv Detail & Related papers (2025-03-15T07:58:25Z) - CLIP's Visual Embedding Projector is a Few-shot Cornucopia [45.93202559299953]
We introduce an alternative way for few-shot CLIP adaptation without adding ''external'' parameters to optimize.<n>We find that simply fine-tuning the embedding projection matrix of the vision leads to better performance than all baselines.<n>This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks, few-shot cross-dataset encoder transfer, domain generalization, and base-to-new class generalization.
arXiv Detail & Related papers (2024-10-07T17:59:59Z) - ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning [54.68180752416519]
Panoptic segmentation is a cutting-edge computer vision task.
We introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE.
Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity.
arXiv Detail & Related papers (2024-03-29T11:31:12Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - EZ-CLIP: Efficient Zeroshot Video Action Recognition [13.403597169664803]
We present EZ-CLIP, a simple and efficient adaptation of CLIP.
We introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion.
EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.
arXiv Detail & Related papers (2023-12-13T09:33:08Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data [122.282521548393]
Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning.<n>We introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Localized Latent Updates for Fine-Tuning Vision-Language Models [15.285292154680246]
In this work we suggest a lightweight adapter, that only updates the models predictions close to seen datapoints.
We demonstrate the effectiveness and speed of this relatively simple approach in the context of few-shot learning, where our results both on classes seen and unseen during training are comparable with or improve on the state of the art.
arXiv Detail & Related papers (2022-12-13T13:15:20Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - The CLEAR Benchmark: Continual LEArning on Real-World Imagery [77.98377088698984]
Continual learning (CL) is widely regarded as crucial challenge for lifelong AI.
We introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts.
We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms.
arXiv Detail & Related papers (2022-01-17T09:09:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.