Are Labels Always Necessary for Classifier Accuracy Evaluation?
- URL: http://arxiv.org/abs/2007.02915v3
- Date: Tue, 25 May 2021 06:32:00 GMT
- Title: Are Labels Always Necessary for Classifier Accuracy Evaluation?
- Authors: Weijian Deng and Liang Zheng
- Abstract summary: We aim to estimate the classification accuracy on unlabeled test datasets.
We construct a meta-dataset comprised of datasets generated from the original images.
As the classification accuracy of the model on each sample (dataset) is known from the original dataset labels, our task can be solved via regression.
- Score: 28.110519483540482
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To calculate the model accuracy on a computer vision task, e.g., object
recognition, we usually require a test set composing of test samples and their
ground truth labels. Whilst standard usage cases satisfy this requirement, many
real-world scenarios involve unlabeled test data, rendering common model
evaluation methods infeasible. We investigate this important and under-explored
problem, Automatic model Evaluation (AutoEval). Specifically, given a labeled
training set and a classifier, we aim to estimate the classification accuracy
on unlabeled test datasets. We construct a meta-dataset: a dataset comprised of
datasets generated from the original images via various transformations such as
rotation, background substitution, foreground scaling, etc. As the
classification accuracy of the model on each sample (dataset) is known from the
original dataset labels, our task can be solved via regression. Using the
feature statistics to represent the distribution of a sample dataset, we can
train regression models (e.g., a regression neural network) to predict model
performance. Using synthetic meta-dataset and real-world datasets in training
and testing, respectively, we report a reasonable and promising prediction of
the model accuracy. We also provide insights into the application scope,
limitation, and potential future direction of AutoEval.
Related papers
- A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data [9.57464542357693]
This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering.
We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset.
After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
arXiv Detail & Related papers (2024-07-02T09:54:39Z) - Learned Label Aggregation for Weak Supervision [8.819582879892762]
We propose a data programming approach that aggregates weak supervision signals to generate labeled data easily.
The quality of the generated labels depends on a label aggregation model that aggregates all noisy labels from all LFs to infer the ground-truth labels.
We show the model can be trained using synthetically generated data and design an effective architecture for the model.
arXiv Detail & Related papers (2022-07-27T14:36:35Z) - Certifying Data-Bias Robustness in Linear Regression [12.00314910031517]
We present a technique for certifying whether linear regression models are pointwise-robust to label bias in a training dataset.
We show how to solve this problem exactly for individual test points, and provide an approximate but more scalable method.
We also unearth gaps in bias-robustness, such as high levels of non-robustness for certain bias assumptions on some datasets.
arXiv Detail & Related papers (2022-06-07T20:47:07Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Label-Free Model Evaluation with Semi-Structured Dataset Representations [78.54590197704088]
Label-free model evaluation, or AutoEval, estimates model accuracy on unlabeled test sets.
In the absence of image labels, based on dataset representations, we estimate model performance for AutoEval with regression.
We propose a new semi-structured dataset representation that is manageable for regression learning while containing rich information for AutoEval.
arXiv Detail & Related papers (2021-12-01T18:15:58Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Evaluating State-of-the-Art Classification Models Against Bayes
Optimality [106.50867011164584]
We show that we can compute the exact Bayes error of generative models learned using normalizing flows.
We use our approach to conduct a thorough investigation of state-of-the-art classification models.
arXiv Detail & Related papers (2021-06-07T06:21:20Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z) - Symbolic Regression Driven by Training Data and Prior Knowledge [0.0]
In symbolic regression, the search for analytic models is driven purely by the prediction error observed on the training data samples.
We propose a multi-objective symbolic regression approach that is driven by both the training data and the prior knowledge of the properties the desired model should manifest.
arXiv Detail & Related papers (2020-04-24T19:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.