Running cognitive evaluations on large language models: The do's and the
don'ts
- URL: http://arxiv.org/abs/2312.01276v1
- Date: Sun, 3 Dec 2023 04:28:19 GMT
- Title: Running cognitive evaluations on large language models: The do's and the
don'ts
- Authors: Anna A. Ivanova
- Abstract summary: I describe methodological considerations for studies that aim to evaluate the cognitive capacities of large language models.
I list 10 do's and don'ts that should help design high-quality cognitive evaluations for AI systems.
- Score: 3.8073142980733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, I describe methodological considerations for studies that aim
to evaluate the cognitive capacities of large language models (LLMs) using
language-based behavioral assessments. Drawing on three case studies from the
literature (a commonsense knowledge benchmark, a theory of mind evaluation, and
a test of syntactic agreement), I describe common pitfalls that might arise
when applying a cognitive test to an LLM. I then list 10 do's and don'ts that
should help design high-quality cognitive evaluations for AI systems. I
conclude by discussing four areas where the do's and don'ts are currently under
active discussion -- prompt sensitivity, cultural and linguistic diversity,
using LLMs as research assistants, and running evaluations on open vs. closed
LLMs. Overall, the goal of the paper is to contribute to the broader discussion
of best practices in the rapidly growing field of AI Psychology.
Related papers
- Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism [62.571419297164645]
This paper provides a systematic overview of prior works on the logical reasoning ability of large language models for analyzing categorical syllogisms.
We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective.
We then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets.
arXiv Detail & Related papers (2024-06-26T21:17:20Z) - MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models [0.5822010906632046]
MultiPragEval is a robust test suite designed for the multilingual pragmatic evaluation of LLMs across English, German, Korean, and Chinese.
Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages.
arXiv Detail & Related papers (2024-06-11T21:46:03Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars --Consistency, Scoring Critera, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations [28.097820924530655]
CPsyExam is designed to prioritize psychological knowledge and case analysis separately.
From the pool of 22k questions, we utilize 4k to create the benchmark.
arXiv Detail & Related papers (2024-05-16T16:02:18Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by
Dissociating Language and Cognition [57.747888532651]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - From Heuristic to Analytic: Cognitively Motivated Strategies for
Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions.
Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z) - Evaluating Subjective Cognitive Appraisals of Emotions from Large
Language Models [47.890846082224066]
This work fills the gap by presenting CovidET-Appraisals, the most comprehensive dataset to-date that assesses 24 appraisal dimensions.
CovidET-Appraisals presents an ideal testbed to evaluate the ability of large language models to automatically assess and explain cognitive appraisals.
arXiv Detail & Related papers (2023-10-22T19:12:17Z) - Spoken Language Intelligence of Large Language Models for Language
Learning [3.5924382852350902]
We focus on evaluating the efficacy of large language models (LLMs) in the realm of education.
We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios.
We also investigate the influence of various prompting techniques such as zero- and few-shot method.
We find that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems.
arXiv Detail & Related papers (2023-08-28T12:47:41Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters
for Implicature Resolution by LLMs [26.118193748582197]
We evaluate four categories of widely used state-of-the-art models.
We find that, despite only evaluating on utterances that require a binary inference, models in three of these categories perform close to random.
These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models.
arXiv Detail & Related papers (2022-10-26T19:04:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.