Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets?
- URL: http://arxiv.org/abs/2501.15431v1
- Date: Sun, 26 Jan 2025 07:19:12 GMT
- Title: Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets?
- Authors: Utku Ozbulak, Esla Timothy Anzaku, Solha Kang, Wesley De Neve, Joris Vankerschaver,
- Abstract summary: We investigate whether models that seem to perform well on ImageNet may experience significant performance declines on similar datasets.
Specifically, state-of-the-art frameworks such as DINO and Swav, which are praised for their performance, exhibit substantial drops in performance.
We argue that otherwise good and desirable properties of models remain hidden when benchmarking is only performed on the ImageNet validation set.
- Score: 1.3821203559674384
- License:
- Abstract: Machine learning (ML) research strongly relies on benchmarks in order to determine the relative effectiveness of newly proposed models. Recently, a number of prominent research effort argued that a number of models that improve the state-of-the-art by a small margin tend to do so by winning what they call a "benchmark lottery". An important benchmark in the field of machine learning and computer vision is the ImageNet where newly proposed models are often showcased based on their performance on this dataset. Given the large number of self-supervised learning (SSL) frameworks that has been proposed in the past couple of years each coming with marginal improvements on the ImageNet dataset, in this work, we evaluate whether those marginal improvements on ImageNet translate to improvements on similar datasets or not. To do so, we investigate twelve popular SSL frameworks on five ImageNet variants and discover that models that seem to perform well on ImageNet may experience significant performance declines on similar datasets. Specifically, state-of-the-art frameworks such as DINO and Swav, which are praised for their performance, exhibit substantial drops in performance while MoCo and Barlow Twins displays comparatively good results. As a result, we argue that otherwise good and desirable properties of models remain hidden when benchmarking is only performed on the ImageNet validation set, making us call for more adequate benchmarking. To avoid the "benchmark lottery" on ImageNet and to ensure a fair benchmarking process, we investigate the usage of a unified metric that takes into account the performance of models on other ImageNet variant datasets.
Related papers
- Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature? [1.4828022319975973]
We analyze the effectiveness of pre-trained state-of-the-art deep neural network (DNN) models on ImageNet and one of its variants, ImageNetV2.
Our findings show that these reported declines are largely attributable to a characteristic of the dataset that has not received sufficient attention.
Our findings highlight the importance of considering the multi-label nature of the ImageNet dataset during benchmarking.
arXiv Detail & Related papers (2024-12-24T12:55:31Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object [78.58860252442045]
We introduce generative model as a data source for hard images that benchmark deep models' robustness.
We are able to generate images with more diversified backgrounds, textures, and materials than any prior work, where we term this benchmark as ImageNet-D.
Our work suggests that diffusion models can be an effective source to test vision models.
arXiv Detail & Related papers (2024-03-27T17:23:39Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z) - ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing [45.14977000707886]
Higher accuracy on ImageNet usually leads to better robustness against different corruptions.
We create a toolkit for object editing with controls of backgrounds, sizes, positions, and directions.
We evaluate the performance of current deep learning models, including both convolutional neural networks and vision transformers.
arXiv Detail & Related papers (2023-03-30T02:02:32Z) - Does progress on ImageNet transfer to real-world datasets? [28.918770106968843]
We evaluate ImageNet pre-trained models with varying accuracy on six practical image classification datasets.
On multiple datasets, models with higher ImageNet accuracy do not consistently yield performance improvements.
We hope that future benchmarks will include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms.
arXiv Detail & Related papers (2023-01-11T18:55:53Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Contemplating real-world object classification [53.10151901863263]
We reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations.
We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement.
arXiv Detail & Related papers (2021-03-08T23:29:59Z) - From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset.
Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.