Related papers: Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

URL: http://arxiv.org/abs/2409.14607v1
Date: Sun, 22 Sep 2024 22:04:26 GMT
Title: Patch Ranking: Efficient CLIP by Learning to Rank Local Patches
Authors: Cheng-En Wu, Jinhong Lin, Yu Hen Hu, Pedro Morgado,
Abstract summary: Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP. We propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. We successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets.
Score: 11.225834286969283
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.

Related papers

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models [94.49953824684853]
We introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition.<n>It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation.<n>An enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate.
arXiv Detail & Related papers (2025-08-03T02:15:43Z)
End-to-End Vision Tokenizer Tuning [73.3065542220568]
The vision tokenizer optimized for low-level reconstruction is to downstream tasks requiring varied representations and semantics.<n>The loss of the vision tokenization can be the representation bottleneck for target tasks.<n>We propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks.
arXiv Detail & Related papers (2025-05-15T17:59:39Z)
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z)
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance [14.849943391904882]
We propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples. Experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks.
arXiv Detail & Related papers (2024-12-05T04:58:28Z)
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer-language representations. SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z)
CLIP's Visual Embedding Projector is a Few-shot Cornucopia [45.93202559299953]
We introduce an alternative way for few-shot CLIP adaptation without adding ''external'' parameters to optimize. We find that simply fine-tuning the embedding projection matrix of the vision leads to better performance than all baselines. This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks, few-shot cross-dataset encoder transfer, domain generalization, and base-to-new class generalization.
arXiv Detail & Related papers (2024-10-07T17:59:59Z)
Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training. Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes. Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z)
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. We investigate the differences in CLIP performance among various neural architectures. We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z)
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts. Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples. We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z)
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization [89.52943129132217]
We propose a Bottom-Up Patch Summarization approach named BUS to learn a concise summary of lengthy visual token sequences efficiently. We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction. This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness.
arXiv Detail & Related papers (2023-07-17T14:08:17Z)
Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z)
Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models. Our method allows for effortless integration with existing models' training pipelines. On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
Patch-level Representation Learning for Self-supervised Vision Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.