Fast Certification of Vision-Language Models Using Incremental
Randomized Smoothing
- URL: http://arxiv.org/abs/2311.09024v2
- Date: Thu, 4 Jan 2024 09:54:46 GMT
- Title: Fast Certification of Vision-Language Models Using Incremental
Randomized Smoothing
- Authors: A K Nirala (1), A Joshi (2), C Hegde (2), S Sarkar (1) ((1) Iowa State
University, (2) New York University)
- Abstract summary: We introduce Open Vocabulary Certification (OVC), a fast certification method for open-vocabulary models like CLIP.
OVC relies on the observation that a classifier with a novel prompt can be viewed as a perturbed version of nearby classifiers in the base training set.
We demonstrate the effectiveness of OVC on through experimental evaluation using multiple vision-language backbones on the CIFAR-10 and ImageNet test datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: A key benefit of deep vision-language models such as CLIP is that they enable
zero-shot open vocabulary classification; the user has the ability to define
novel class labels via natural language prompts at inference time. However,
while CLIP-based zero-shot classifiers have demonstrated competitive
performance across a range of domain shifts, they remain highly vulnerable to
adversarial attacks. Therefore, ensuring the robustness of such models is
crucial for their reliable deployment in the wild.
In this work, we introduce Open Vocabulary Certification (OVC), a fast
certification method designed for open-vocabulary models like CLIP via
randomized smoothing techniques. Given a base "training" set of prompts and
their corresponding certified CLIP classifiers, OVC relies on the observation
that a classifier with a novel prompt can be viewed as a perturbed version of
nearby classifiers in the base training set. Therefore, OVC can rapidly certify
the novel classifier using a variation of incremental randomized smoothing. By
using a caching trick, we achieve approximately two orders of magnitude
acceleration in the certification process for novel prompts. To achieve further
(heuristic) speedups, OVC approximates the embedding space at a given input
using a multivariate normal distribution bypassing the need for sampling via
forward passes through the vision backbone. We demonstrate the effectiveness of
OVC on through experimental evaluation using multiple vision-language backbones
on the CIFAR-10 and ImageNet test datasets.
Related papers
- OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models [20.88680592729709]
We propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models.
BaFTA directly estimates class centroids using online clustering within a projected embedding space.
We demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-06-17T08:16:24Z) - Semantic Residual Prompts for Continual Learning [21.986800282078498]
We show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test.
Our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model.
arXiv Detail & Related papers (2024-03-11T16:23:38Z) - In-context Prompt Learning for Test-time Vision Recognition with Frozen
Vision-language Model [17.9086654601105]
Motivated by in-context learning within field of natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition task.
InCPL involves associating a new test sample with very few or even just one labeled example as its in-context prompt.
Our method has demonstrated superior performance and achieved state-of-the-art results across various downstream datasets.
arXiv Detail & Related papers (2024-03-10T08:15:51Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE)
Our model significantly outperforms state-of-the-art alternatives.
Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z) - Revisiting Deep Local Descriptor for Improved Few-Shot Classification [56.74552164206737]
We show how one can improve the quality of embeddings by leveraging textbfDense textbfClassification and textbfAttentive textbfPooling.
We suggest to pool feature maps by applying attentive pooling instead of the widely used global average pooling (GAP) to prepare embeddings for few-shot classification.
arXiv Detail & Related papers (2021-03-30T00:48:28Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - Discriminative Nearest Neighbor Few-Shot Intent Detection by
Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity.
We present a discriminative nearest neighbor classification with deep self-attention.
We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.