PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition
- URL: http://arxiv.org/abs/2503.16945v1
- Date: Fri, 21 Mar 2025 08:45:50 GMT
- Title: PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition
- Authors: Ibtissam Saadi, Abdenour Hadid, Douglas W. Cunningham, Abdelmalik Taleb-Ahmed, Yassin El Hillali,
- Abstract summary: Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER)<n>We propose PE-CLIP, a parameter-efficient fine-tuning framework that adapts CLIP for DFER while significantly reducing trainable parameters.<n>By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER.
- Score: 7.966499123076283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .
Related papers
- Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.
MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.
The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z) - DiffCLIP: Differential Attention Meets CLIP [57.396578974401734]
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures.<n>With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks.
arXiv Detail & Related papers (2025-03-09T14:04:09Z) - Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product [8.014705094248589]
Low- parameters prompt tuning method outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
Experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
arXiv Detail & Related papers (2025-02-16T05:50:12Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation [10.502680141980642]
Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions.<n> vision-language foundation models, especially CLIP, have emerged as powerful tools for acquiring open-vocabulary capabilities.<n>H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.
arXiv Detail & Related papers (2024-05-29T07:41:34Z) - Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion [23.62010759076202]
We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels.
Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
arXiv Detail & Related papers (2023-12-17T11:59:14Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.
In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.