Intriguing Differences Between Zero-Shot and Systematic Evaluations of
Vision-Language Transformer Models
- URL: http://arxiv.org/abs/2402.08473v1
- Date: Tue, 13 Feb 2024 14:07:49 GMT
- Title: Intriguing Differences Between Zero-Shot and Systematic Evaluations of
Vision-Language Transformer Models
- Authors: Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu, Lingjiong Zhu
- Abstract summary: Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets.
In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model.
Using the Imagenette dataset, we show that while the model achieves over 99% zero-shot classification performance, it fails systematic evaluations completely.
- Score: 7.360937524701675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have dominated natural language processing and other
areas in the last few years due to their superior (zero-shot) performance on
benchmark datasets. However, these models are poorly understood due to their
complexity and size. While probing-based methods are widely used to understand
specific properties, the structures of the representation space are not
systematically characterized; consequently, it is unclear how such models
generalize and overgeneralize to new inputs beyond datasets. In this paper,
based on a new gradient descent optimization method, we are able to explore the
embedding space of a commonly used vision-language model. Using the Imagenette
dataset, we show that while the model achieves over 99\% zero-shot
classification performance, it fails systematic evaluations completely. Using a
linear approximation, we provide a framework to explain the striking
differences. We have also obtained similar results using a different model to
support that our results are applicable to other transformer models with
continuous inputs. We also propose a robust way to detect the modified images.
Related papers
- Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Image Similarity using An Ensemble of Context-Sensitive Models [2.9490616593440317]
We present a more intuitive approach to build and compare image similarity models based on labelled data.
We address the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data.
Our testing results show that the ensemble model constructed performs 5% better than the best individual context-sensitive models.
arXiv Detail & Related papers (2024-01-15T20:23:05Z) - COSE: A Consistency-Sensitivity Metric for Saliency on Image
Classification [21.3855970055692]
We present a set of metrics that utilize vision priors to assess the performance of saliency methods on image classification tasks.
We show that although saliency methods are thought to be architecture-independent, most methods could better explain transformer-based models over convolutional-based models.
arXiv Detail & Related papers (2023-09-20T01:06:44Z) - Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models [38.16654407693728]
We introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle.
Our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations.
arXiv Detail & Related papers (2023-08-21T11:07:27Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - IMACS: Image Model Attribution Comparison Summaries [16.80986701058596]
We introduce IMACS, a method that combines gradient-based model attributions with aggregation and visualization techniques.
IMACS extracts salient input features from an evaluation dataset, clusters them based on similarity, then visualizes differences in model attributions for similar input features.
We show how our technique can uncover behavioral differences caused by domain shift between two models trained on satellite images.
arXiv Detail & Related papers (2022-01-26T21:35:14Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - An application of a pseudo-parabolic modeling to texture image
recognition [0.0]
We present a novel methodology for texture image recognition using a partial differential equation modeling.
We employ the pseudo-parabolic Buckley-Leverett equation to provide a dynamics to the digital image representation and collect local descriptors from those images evolving in time.
arXiv Detail & Related papers (2021-02-09T18:08:42Z) - Distilling Interpretable Models into Human-Readable Code [71.11328360614479]
Human-readability is an important and desirable standard for machine-learned model interpretability.
We propose to train interpretable models using conventional methods, and then distill them into concise, human-readable code.
We describe a piecewise-linear curve-fitting algorithm that produces high-quality results efficiently and reliably across a broad range of use cases.
arXiv Detail & Related papers (2021-01-21T01:46:36Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Evaluating the Disentanglement of Deep Generative Models through
Manifold Topology [66.06153115971732]
We present a method for quantifying disentanglement that only uses the generative model.
We empirically evaluate several state-of-the-art models across multiple datasets.
arXiv Detail & Related papers (2020-06-05T20:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.