Automated Classification of Model Errors on ImageNet
- URL: http://arxiv.org/abs/2401.02430v1
- Date: Mon, 13 Nov 2023 20:41:39 GMT
- Title: Automated Classification of Model Errors on ImageNet
- Authors: Momchil Peychev, Mark Niklas M\"uller, Marc Fischer, Martin Vechev
- Abstract summary: We propose an automated error classification framework to study how modeling choices affect error distributions.
We use our framework to comprehensively evaluate the error distribution of over 900 models.
In particular, we observe that the portion of severe errors drops significantly with top-1 accuracy indicating that, while it underreports a model's true performance, it remains a valuable performance metric.
- Score: 7.455546102930913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While the ImageNet dataset has been driving computer vision research over the
past decade, significant label noise and ambiguity have made top-1 accuracy an
insufficient measure of further progress. To address this, new label-sets and
evaluation protocols have been proposed for ImageNet showing that
state-of-the-art models already achieve over 95% accuracy and shifting the
focus on investigating why the remaining errors persist.
Recent work in this direction employed a panel of experts to manually
categorize all remaining classification errors for two selected models.
However, this process is time-consuming, prone to inconsistencies, and requires
trained experts, making it unsuitable for regular model evaluation thus
limiting its utility. To overcome these limitations, we propose the first
automated error classification framework, a valuable tool to study how modeling
choices affect error distributions. We use our framework to comprehensively
evaluate the error distribution of over 900 models. Perhaps surprisingly, we
find that across model architectures, scales, and pre-training corpora, top-1
accuracy is a strong predictor for the portion of all error types. In
particular, we observe that the portion of severe errors drops significantly
with top-1 accuracy indicating that, while it underreports a model's true
performance, it remains a valuable performance metric.
We release all our code at
https://github.com/eth-sri/automated-error-analysis .
Related papers
- SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract.
We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z) - Intrinsic Self-Supervision for Data Quality Audits [35.69673085324971]
Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors.
In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, or a scoring problem.
We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases.
arXiv Detail & Related papers (2023-05-26T15:57:04Z) - Bridging Precision and Confidence: A Train-Time Loss for Calibrating
Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions.
Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z) - Crowd Density Estimation using Imperfect Labels [3.2575001434344286]
We propose a system that automatically generates imperfect labels using a deep learning model (called annotator)
Our analysis on two crowd counting models and two benchmark datasets shows that the proposed scheme achieves accuracy closer to that of the model trained with perfect labels.
arXiv Detail & Related papers (2022-12-02T21:21:40Z) - When does dough become a bagel? Analyzing the remaining mistakes on
ImageNet [13.36146792987668]
We review and categorize every remaining mistake that a few top models make in order to provide insight into the long-tail of errors on one of the most benchmarked datasets in computer vision.
Our analysis reveals that nearly half of the supposed mistakes are not mistakes at all, and we uncover new valid multi-labels.
To calibrate future progress on ImageNet, we provide an updated multi-label evaluation set, and we curate ImageNet-Major: a 68-example "major error" slice of the obvious mistakes made by today's top models.
arXiv Detail & Related papers (2022-05-09T23:25:45Z) - Is the Performance of My Deep Network Too Good to Be True? A Direct
Approach to Estimating the Bayes Error in Binary Classification [86.32752788233913]
In classification problems, the Bayes error can be used as a criterion to evaluate classifiers with state-of-the-art performance.
We propose a simple and direct Bayes error estimator, where we just take the mean of the labels that show emphuncertainty of the classes.
Our flexible approach enables us to perform Bayes error estimation even for weakly supervised data.
arXiv Detail & Related papers (2022-02-01T13:22:26Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Detecting Errors and Estimating Accuracy on Unlabeled Data with
Self-training Ensembles [38.23896575179384]
We propose a principled and practically effective framework that simultaneously addresses the two tasks.
One instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7%.
On iWildCam, one instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7%.
arXiv Detail & Related papers (2021-06-29T21:32:51Z) - Evaluating State-of-the-Art Classification Models Against Bayes
Optimality [106.50867011164584]
We show that we can compute the exact Bayes error of generative models learned using normalizing flows.
We use our approach to conduct a thorough investigation of state-of-the-art classification models.
arXiv Detail & Related papers (2021-06-07T06:21:20Z) - Defuse: Harnessing Unrestricted Adversarial Examples for Debugging
Models Beyond Test Accuracy [11.265020351747916]
Defuse is a method to automatically discover and correct model errors beyond those available in test data.
We propose an algorithm inspired by adversarial machine learning techniques that uses a generative model to find naturally occurring instances misclassified by a model.
Defuse corrects the error after fine-tuning while maintaining generalization on the test set.
arXiv Detail & Related papers (2021-02-11T18:08:42Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.