Related papers: Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition

Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition

URL: http://arxiv.org/abs/2505.06002v2
Date: Thu, 03 Jul 2025 03:25:01 GMT
Title: Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition
Authors: Congqi Cao, Peiheng Han, Yueran zhang, Yating Yu, Qinyi Lv, Lingtong Min, Yanning zhang,
Abstract summary: We propose a parameter-efficient dual adaptation method for both image and text encoders.<n>Specifically, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction.<n>We develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions.
Score: 33.22316608406554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale pre-trained models have achieved remarkable success in language and image tasks, leading an increasing number of studies to explore the application of pre-trained image models, such as CLIP, in the domain of few-shot action recognition (FSAR). However, current methods generally suffer from several problems: 1) Direct fine-tuning often undermines the generalization capability of the pre-trained model; 2) The exploration of task-specific information is insufficient in the visual tasks; 3) The semantic order information is typically overlooked during text modeling; 4) Existing cross-modal alignment techniques ignore the temporal coupling of multimodal information. To address these, we propose Task-Adapter++, a parameter-efficient dual adaptation method for both image and text encoders. Specifically, to make full use of the variations across different few-shot learning tasks, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction. Furthermore, we leverage large language models (LLMs) to generate detailed sequential sub-action descriptions for each action class, and introduce semantic order adapters into the text encoder to effectively model the sequential relationships between these sub-actions. Finally, we develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions. Extensive experiments fully demonstrate the effectiveness and superiority of the proposed method, which achieves state-of-the-art performance on 5 benchmarks consistently. The code is open-sourced at https://github.com/Jaulin-Bage/Task-Adapter-pp.

Related papers

CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval [22.01591564940522]
We introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability.<n>In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios.<n>Our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, but also showcases robustness and scalability.
arXiv Detail & Related papers (2025-04-26T03:26:30Z)
Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task.<n>Our approach incorporates learnable prompts to adapt to new instructions for ranking purposes.<n>Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks.
arXiv Detail & Related papers (2024-12-09T18:51:05Z)
Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition [34.88916568947695]
We propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone, we mitigate the overfitting problem caused by full fine-tuning. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets.
arXiv Detail & Related papers (2024-08-01T03:06:56Z)
Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z)
MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.<n>We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.<n>To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z)
Distribution Matching for Multi-Task Learning of Classification Tasks: a Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space. We show that MTL can be successful with classification tasks with little, or non-overlapping annotations. We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z)
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices. Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z)
GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph [63.81641578763094]
adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) We propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively.
arXiv Detail & Related papers (2023-09-24T12:56:40Z)
Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly. We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.