Evaluating General-Purpose AI with Psychometrics
- URL: http://arxiv.org/abs/2310.16379v2
- Date: Fri, 29 Dec 2023 05:42:07 GMT
- Title: Evaluating General-Purpose AI with Psychometrics
- Authors: Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell,
Luning Sun, Fang Luo, Xing Xie
- Abstract summary: We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
- Score: 43.85432514910491
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Comprehensive and accurate evaluation of general-purpose AI systems such as
large language models allows for effective mitigation of their risks and
deepened understanding of their capabilities. Current evaluation methodology,
mostly based on benchmarks of specific tasks, falls short of adequately
assessing these versatile AI systems, as present techniques lack a scientific
foundation for predicting their performance on unforeseen tasks and explaining
their varying performance on specific task items or user inputs. Moreover,
existing benchmarks of specific tasks raise growing concerns about their
reliability and validity. To tackle these challenges, we suggest transitioning
from task-oriented evaluation to construct-oriented evaluation. Psychometrics,
the science of psychological measurement, provides a rigorous methodology for
identifying and measuring the latent constructs that underlie performance
across multiple tasks. We discuss its merits, warn against potential pitfalls,
and propose a framework to put it into practice. Finally, we explore future
opportunities of integrating psychometrics with the evaluation of
general-purpose AI systems.
Related papers
- Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems.
The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
arXiv Detail & Related papers (2024-11-17T02:35:30Z) - Pessimistic Evaluation [58.736490198613154]
We argue that evaluating information access systems assumes utilitarian values not aligned with traditions of information access based on equal access.
We advocate for pessimistic evaluation of information access systems focusing on worst case utility.
arXiv Detail & Related papers (2024-10-17T15:40:09Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Evaluating Human-AI Collaboration: A Review and Methodological Framework [4.41358655687435]
The use of artificial intelligence (AI) in working environments with individuals, known as Human-AI Collaboration (HAIC), has become essential.
evaluating HAIC's effectiveness remains challenging due to the complex interaction of components involved.
This paper provides a detailed analysis of existing HAIC evaluation approaches and develops a fresh paradigm for more effectively evaluating these systems.
arXiv Detail & Related papers (2024-07-09T12:52:22Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric.
We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z) - Injecting Planning-Awareness into Prediction and Detection Evaluation [42.228191984697006]
We take a step back and critically assess current evaluation metrics, proposing task-aware metrics as a better measure of performance in systems where they are deployed.
Experiments on an illustrative simulation as well as real-world autonomous driving data validate that our proposed task-aware metrics are able to account for outcome asymmetry and provide a better estimate of a model's closed-loop performance.
arXiv Detail & Related papers (2021-10-07T08:52:48Z) - On the uncertainty of self-supervised monocular depth estimation [52.13311094743952]
Self-supervised paradigms for monocular depth estimation are very appealing since they do not require ground truth annotations at all.
We explore for the first time how to estimate the uncertainty for this task and how this affects depth accuracy.
We propose a novel peculiar technique specifically designed for self-supervised approaches.
arXiv Detail & Related papers (2020-05-13T09:00:55Z) - What's a Good Prediction? Challenges in evaluating an agent's knowledge [0.9281671380673306]
We show the conflict between accuracy and usefulness of general knowledge.
We propose an alternate evaluation approach that arises continually in the online continual learning setting.
This paper contributes a first look into evaluation of predictions through their use.
arXiv Detail & Related papers (2020-01-23T21:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.