ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
- URL: http://arxiv.org/abs/2311.09215v3
- Date: Tue, 23 Jul 2024 16:20:54 GMT
- Title: ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
- Authors: Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu,
- Abstract summary: In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy.
Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects.
This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis.
- Score: 27.75360812109922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.
Related papers
- ImageNot: A contrast with ImageNet preserves model rankings [16.169858780154893]
We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects.
Key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet.
arXiv Detail & Related papers (2024-04-02T17:13:04Z) - ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object [78.58860252442045]
We introduce generative model as a data source for hard images that benchmark deep models' robustness.
We are able to generate images with more diversified backgrounds, textures, and materials than any prior work, where we term this benchmark as ImageNet-D.
Our work suggests that diffusion models can be an effective source to test vision models.
arXiv Detail & Related papers (2024-03-27T17:23:39Z) - Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts [22.74552390076515]
We probe the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones and pretraining sets.
We detect the presence of outlier features in robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-transformer models.
We find the existence of outlier features to be an indication of ImageNet shift robustness in models, since we only find them in robust models in our analysis.
arXiv Detail & Related papers (2023-10-19T17:59:12Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing [45.14977000707886]
Higher accuracy on ImageNet usually leads to better robustness against different corruptions.
We create a toolkit for object editing with controls of backgrounds, sizes, positions, and directions.
We evaluate the performance of current deep learning models, including both convolutional neural networks and vision transformers.
arXiv Detail & Related papers (2023-03-30T02:02:32Z) - Revisiting Adversarial Training for ImageNet: Architectures, Training
and Generalization across Threat Models [52.86163536826919]
We revisit adversarial training on ImageNet comparing ViTs and ConvNeXts.
Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust generalizations across different ranges of model parameters.
Our ViT + ConvStem yields the best generalization to unseen threat models.
arXiv Detail & Related papers (2023-03-03T11:53:01Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - ImageNet-X: Understanding Model Mistakes with Factor of Variation
Annotations [36.348968311668564]
We introduce ImageNet-X, a set of sixteen human annotations of factors such as pose, background, or lighting.
We investigate 2,200 current recognition models and study the types of mistakes as a function of model's architecture.
We find models have consistent failure modes across ImageNet-X categories.
arXiv Detail & Related papers (2022-11-03T14:56:32Z) - Identical Image Retrieval using Deep Learning [0.0]
We are using the BigTransfer Model, which is a state-of-art model itself.
We extract the key features and train on the K-Nearest Neighbor model to obtain the nearest neighbor.
The application of our model is to find similar images, which are hard to achieve through text queries within a low inference time.
arXiv Detail & Related papers (2022-05-10T13:34:41Z) - Do Adversarially Robust ImageNet Models Transfer Better? [102.09335596483695]
adversarially robust models often perform better than their standard-trained counterparts when used for transfer learning.
Our results are consistent with (and in fact, add to) recent hypotheses stating that robustness leads to improved feature representations.
arXiv Detail & Related papers (2020-07-16T17:42:40Z) - Learning to Learn Parameterized Classification Networks for Scalable
Input Images [76.44375136492827]
Convolutional Neural Networks (CNNs) do not have a predictable recognition behavior with respect to the input resolution change.
We employ meta learners to generate convolutional weights of main networks for various input scales.
We further utilize knowledge distillation on the fly over model predictions based on different input resolutions.
arXiv Detail & Related papers (2020-07-13T04:27:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.