Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing
Perspective
- URL: http://arxiv.org/abs/2306.10512v2
- Date: Sat, 28 Oct 2023 13:02:24 GMT
- Title: Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing
Perspective
- Authors: Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang,
Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, Enhong Chen
- Abstract summary: Large language models (LLMs) have shown some human-like cognitive abilities.
We propose an adaptive testing framework for LLM evaluation.
This approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance.
- Score: 63.92197404447808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs), like ChatGPT, have shown some human-like
cognitive abilities. For comparing these abilities of different models, several
benchmarks (i.e. sets of standard test questions) from different fields (e.g.,
Literature, Biology and Psychology) are often adopted and the test results
under traditional metrics such as accuracy, recall and F1, are reported.
However, such way for evaluating LLMs can be inefficient and inaccurate from
the cognitive science perspective. Inspired by Computerized Adaptive Testing
(CAT) used in psychometrics, we propose an adaptive testing framework for LLM
evaluation. Rather than using a standard test set and simply reporting
accuracy, this approach dynamically adjusts the characteristics of the test
questions, such as difficulty, based on the model's performance. This allows
for a more accurate estimation of the model's abilities, using fewer questions.
More importantly, it allows LLMs to be compared with humans easily, which is
essential for NLP models that aim for human-level ability. Our diagnostic
reports have found that ChatGPT often behaves like a ``careless student'',
prone to slip and occasionally guessing the questions. We conduct a
fine-grained diagnosis and rank the latest 6 instruction-tuned LLMs from three
aspects of Subject Knowledge, Mathematical Reasoning, and Programming, where
GPT4 can outperform other models significantly and reach the cognitive ability
of middle-level students. Different tests for different models using efficient
adaptive testing -- we believe this has the potential to become a new norm in
evaluating large language models.
Related papers
- PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Proficiency in 8th Grade Mathematics [3.9362370389588834]
We introduce a novel framework for Psychometrics-AssisTed benCHmarking of LLMs.
We implement benchmark by measuring GPT-4 and Gemini-Pro-Vision's proficiency in 8th grade mathematics against 56 human populations.
We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on existing practices.
arXiv Detail & Related papers (2024-04-02T09:58:57Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - ALBA: Adaptive Language-based Assessments for Mental Health [7.141164121152202]
This work introduces the task of Adaptive Language-Based Assessment ALBA.
It involves adaptively ordering questions while also scoring an individual's latent psychological trait using limited language responses to previous questions.
We found ALIRT to be the most accurate and scalable, achieving the highest accuracy with fewer questions.
arXiv Detail & Related papers (2023-11-11T03:37:17Z) - Models of reference production: How do they withstand the test of time? [6.651864489482537]
We use the task of generating referring expressions in context as a case study and start our analysis from GREC.
We ask what the performance of models would be if we assessed them on more realistic datasets.
We conclude that GREC can no longer be regarded as offering a reliable assessment of models' ability to mimic human reference production.
arXiv Detail & Related papers (2023-07-27T12:46:38Z) - Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large
Language Models with SocKET Benchmark [14.922083834969323]
Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks.
We introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge.
arXiv Detail & Related papers (2023-05-24T09:21:06Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z) - Classifier Data Quality: A Geometric Complexity Based Method for
Automated Baseline And Insights Generation [4.722075132982135]
A major challenge is to determine when the level of incorrectness, e.g., model accuracy or F1 score for classifiers, is acceptable.
We have developed complexity measures, which quantify how difficult given observations are to assign to their true class label.
These measures are superior to the best practice baseline in that, for a linear computation cost, they also quantify each observation' classification complexity in an explainable form.
arXiv Detail & Related papers (2021-12-22T12:17:08Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Quality meets Diversity: A Model-Agnostic Framework for Computerized
Adaptive Testing [60.38182654847399]
Computerized Adaptive Testing (CAT) is emerging as a promising testing application in many scenarios.
We propose a novel framework, namely Model-Agnostic Adaptive Testing (MAAT) for CAT solution.
arXiv Detail & Related papers (2021-01-15T06:48:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.