Related papers: Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

URL: http://arxiv.org/abs/2504.01890v2
Date: Mon, 07 Apr 2025 08:59:15 GMT
Title: Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Authors: Shreyank N Gowda, Boyan Gao, Xiao Gu, Xiaobo Jin,
Abstract summary: We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture.<n>TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data.
Score: 11.47868206641396
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.

Related papers

Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder [0.6437284704257459]
We present a new Factorized Graph Sequence network that runs in real-time and scales effectively in the temporal dimension.<n>We also introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings.<n>Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3% and 5.6% improvement in F1-macro score.
arXiv Detail & Related papers (2025-03-15T07:58:25Z)
CLIP's Visual Embedding Projector is a Few-shot Cornucopia [45.93202559299953]
We introduce an alternative way for few-shot CLIP adaptation without adding ''external'' parameters to optimize.<n>We find that simply fine-tuning the embedding projection matrix of the vision leads to better performance than all baselines.<n>This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks, few-shot cross-dataset encoder transfer, domain generalization, and base-to-new class generalization.
arXiv Detail & Related papers (2024-10-07T17:59:59Z)
ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning [54.68180752416519]
Panoptic segmentation is a cutting-edge computer vision task. We introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE. Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity.
arXiv Detail & Related papers (2024-03-29T11:31:12Z)
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters. To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z)
EZ-CLIP: Efficient Zeroshot Video Action Recognition [13.403597169664803]
We present EZ-CLIP, a simple and efficient adaptation of CLIP. We introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion. EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.
arXiv Detail & Related papers (2023-12-13T09:33:08Z)
Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z)
Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data [122.282521548393]
Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning.<n>We introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training.
arXiv Detail & Related papers (2023-05-09T07:00:17Z)
Localized Latent Updates for Fine-Tuning Vision-Language Models [15.285292154680246]
In this work we suggest a lightweight adapter, that only updates the models predictions close to seen datapoints. We demonstrate the effectiveness and speed of this relatively simple approach in the context of few-shot learning, where our results both on classes seen and unseen during training are comparable with or improve on the state of the art.
arXiv Detail & Related papers (2022-12-13T13:15:20Z)
Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z)
The CLEAR Benchmark: Continual LEArning on Real-World Imagery [77.98377088698984]
Continual learning (CL) is widely regarded as crucial challenge for lifelong AI. We introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts. We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms.
arXiv Detail & Related papers (2022-01-17T09:09:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.