DiffCLIP: Differential Attention Meets CLIP
- URL: http://arxiv.org/abs/2503.06626v1
- Date: Sun, 09 Mar 2025 14:04:09 GMT
- Title: DiffCLIP: Differential Attention Meets CLIP
- Authors: Hasan Abed Al Kader Hammoud, Bernard Ghanem,
- Abstract summary: We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures.<n>With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks.
- Score: 57.396578974401734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.
Related papers
- Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.
Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.
Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition [7.966499123076283]
Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER)
We propose PE-CLIP, a parameter-efficient fine-tuning framework that adapts CLIP for DFER while significantly reducing trainable parameters.
By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER.
arXiv Detail & Related papers (2025-03-21T08:45:50Z) - CLIP-driven Dual Feature Enhancing Network for Gaze Estimation [26.00124975891083]
We propose a novel CLIP-driven Dual Feature Enhancing Network (CLIP-DFENet) to boost gaze estimation performance.<n>A Language-driven Differential Module (LDM) is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gaze.<n>A Vision-driven Fusion Module (VFM) is introduced to strengthen the generalized and valuable components of visual embeddings obtained via CLIP's image encoder.<n>A robust Double-head Gaze Regressor is adopted to map the enhanced features to gaze directions.
arXiv Detail & Related papers (2025-02-27T14:23:20Z) - Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer representations.<n>SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling [21.734200158914476]
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence.
DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces.
Experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks.
arXiv Detail & Related papers (2024-09-28T09:28:51Z) - ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality.
We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z) - SpeechCLIP+: Self-supervised multi-task representation learning for
speech via CLIP and speech-image data [69.20254987896674]
SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription.
This paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture.
Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.
arXiv Detail & Related papers (2024-02-10T14:26:42Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning.
Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion.
On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z) - CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.