Evaluating General-Purpose AI with Psychometrics
- URL: http://arxiv.org/abs/2310.16379v2
- Date: Fri, 29 Dec 2023 05:42:07 GMT
- Title: Evaluating General-Purpose AI with Psychometrics
- Authors: Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell,
Luning Sun, Fang Luo, Xing Xie
- Abstract summary: We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
- Score: 43.85432514910491
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Comprehensive and accurate evaluation of general-purpose AI systems such as
large language models allows for effective mitigation of their risks and
deepened understanding of their capabilities. Current evaluation methodology,
mostly based on benchmarks of specific tasks, falls short of adequately
assessing these versatile AI systems, as present techniques lack a scientific
foundation for predicting their performance on unforeseen tasks and explaining
their varying performance on specific task items or user inputs. Moreover,
existing benchmarks of specific tasks raise growing concerns about their
reliability and validity. To tackle these challenges, we suggest transitioning
from task-oriented evaluation to construct-oriented evaluation. Psychometrics,
the science of psychological measurement, provides a rigorous methodology for
identifying and measuring the latent constructs that underlie performance
across multiple tasks. We discuss its merits, warn against potential pitfalls,
and propose a framework to put it into practice. Finally, we explore future
opportunities of integrating psychometrics with the evaluation of
general-purpose AI systems.
Related papers
- Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Developing and Evaluating a Design Method for Positive Artificial
Intelligence [0.6138671548064356]
Development of "AI for good" poses challenges around aligning systems with complex human values.
This article presents and evaluates the Positive AI design method aimed at addressing this gap.
The method provides a human-centered process to translate wellbeing aspirations into concrete practices.
arXiv Detail & Related papers (2024-02-02T15:31:08Z) - Towards a Comprehensive Human-Centred Evaluation Framework for
Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems.
We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z) - Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic
Approach [59.77710485234197]
This paper presents an approach to deriving a learner model directly from an assessment rubric.
We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z) - Injecting Planning-Awareness into Prediction and Detection Evaluation [42.228191984697006]
We take a step back and critically assess current evaluation metrics, proposing task-aware metrics as a better measure of performance in systems where they are deployed.
Experiments on an illustrative simulation as well as real-world autonomous driving data validate that our proposed task-aware metrics are able to account for outcome asymmetry and provide a better estimate of a model's closed-loop performance.
arXiv Detail & Related papers (2021-10-07T08:52:48Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - On the uncertainty of self-supervised monocular depth estimation [52.13311094743952]
Self-supervised paradigms for monocular depth estimation are very appealing since they do not require ground truth annotations at all.
We explore for the first time how to estimate the uncertainty for this task and how this affects depth accuracy.
We propose a novel peculiar technique specifically designed for self-supervised approaches.
arXiv Detail & Related papers (2020-05-13T09:00:55Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z) - What's a Good Prediction? Challenges in evaluating an agent's knowledge [0.9281671380673306]
We show the conflict between accuracy and usefulness of general knowledge.
We propose an alternate evaluation approach that arises continually in the online continual learning setting.
This paper contributes a first look into evaluation of predictions through their use.
arXiv Detail & Related papers (2020-01-23T21:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.