Evaluation of human-model prediction difference on the Internet Scale of Data
- URL: http://arxiv.org/abs/2312.03291v2
- Date: Sun, 10 Nov 2024 06:12:24 GMT
- Title: Evaluation of human-model prediction difference on the Internet Scale of Data
- Authors: Weitang Liu, Ying Wai Li, Yuelei Li, Zihan Wang, Yi-Zhuang You, Jingbo Shang,
- Abstract summary: evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs.
We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space.
- Score: 32.7296837724399
- License:
- Abstract: Evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs. It would be beneficial if we could evaluate the difference between human annotation and model prediction for an internet number of inputs, or more generally, for an input space that enumeration is computationally impractical. Traditional model evaluation methods rely on precision and recall (PR) as metrics, which are typically estimated by comparing human annotations with model predictions on a specific dataset. This is feasible because enumerating thousands of test inputs is manageable. However, estimating PR across a large input space is challenging because enumeration becomes computationally infeasible. We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space. OmniInput is distinctive from previous works as its estimated PR reflects the estimation of the differences between human annotation and model prediction in the input space which is usually too huge to be enumerated. We empirically validate our method within an enumerable input space, and our experiments demonstrate that OmniInput can effectively estimate and compare precision and recall for (large) language models within a broad input space that is not enumerable.
Related papers
- Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling [20.078602767179355]
Failure to properly account for errors in machine learning predictions renders standard statistical procedures invalid.
We introduce bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed.
We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
arXiv Detail & Related papers (2025-01-30T18:46:43Z) - How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop a suite of selectors to get the most informative datapoints for human evaluation.
We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection.
In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z) - Model-diff: A Tool for Comparative Study of Language Models in the Input Space [34.680890752084004]
We propose a new model comparative analysis setting that considers a large input space where brute-force enumeration would be infeasible.
Experiments reveal for the first time the quantitative prediction differences between LMs in a large input space, potentially facilitating the model analysis for applications such as model plagiarism.
arXiv Detail & Related papers (2024-12-13T00:06:25Z) - Sparse Prototype Network for Explainable Pedestrian Behavior Prediction [60.80524827122901]
We present Sparse Prototype Network (SPN), an explainable method designed to simultaneously predict a pedestrian's future action, trajectory, and pose.
Regularized by mono-semanticity and clustering constraints, the prototypes learn consistent and human-understandable features.
arXiv Detail & Related papers (2024-10-16T03:33:40Z) - Knockout: A simple way to handle missing inputs [8.05324050767023]
Models that leverage rich inputs can be difficult to deploy widely because some inputs may be missing at inference.
Current popular solutions to this problem include marginalization, imputation, and training multiple models.
We propose an efficient way to learn both the conditional distribution using full inputs and the marginal distributions.
arXiv Detail & Related papers (2024-05-30T19:47:34Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Efficient Shapley Values Estimation by Amortization for Text
Classification [66.7725354593271]
We develop an amortized model that directly predicts each input feature's Shapley Value without additional model evaluations.
Experimental results on two text classification datasets demonstrate that our amortized model estimates Shapley Values accurately with up to 60 times speedup.
arXiv Detail & Related papers (2023-05-31T16:19:13Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - PAMI: partition input and aggregate outputs for model interpretation [69.42924964776766]
In this study, a simple yet effective visualization framework called PAMI is proposed based on the observation that deep learning models often aggregate features from local regions for model predictions.
The basic idea is to mask majority of the input and use the corresponding model output as the relative contribution of the preserved input part to the original model prediction.
Extensive experiments on multiple tasks confirm the proposed method performs better than existing visualization approaches in more precisely finding class-specific input regions.
arXiv Detail & Related papers (2023-02-07T08:48:34Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - Detecting unusual input to neural networks [0.48733623015338234]
We study a method that judges the unusualness of an input by evaluating its informative content compared to the learned parameters.
This technique can be used to judge whether a network is suitable for processing a certain input and to raise a red flag that unexpected behavior might lie ahead.
arXiv Detail & Related papers (2020-06-15T10:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.