Related papers: Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

URL: http://arxiv.org/abs/2407.10899v1
Date: Mon, 15 Jul 2024 16:49:26 GMT
Title: Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis
Authors: Yunting Liu, Shreya Bhandari, Zachary A. Pardos,
Abstract summary: We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students.
Score: 4.59804401179409
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective educational measurement relies heavily on the curation of well-designed item pools (i.e., possessing the right psychometric properties). However, item calibration is time-consuming and costly, requiring a sufficient number of respondents for the response process. We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) and various combinations of them using sampling methods to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students. No single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution. The item parameters calibrated by LLM-Respondents have high correlations (e.g. > 0.8 for GPT-3.5) compared to their human calibrated counterparts, and closely resemble the parameters of the human subset (e.g. 0.02 Spearman correlation difference). Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).

Related papers

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence [15.56427716190418]
Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable.<n> statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses.
arXiv Detail & Related papers (2026-02-17T18:18:38Z)
Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks [8.246529401043128]
We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals.<n>We develop a statistical evaluation method based on Krippendorff's $alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure.<n>We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former.
arXiv Detail & Related papers (2025-10-08T05:17:33Z)
AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans [15.572185318032139]
Large Language Models (LLMs) with hundreds of billions of parameters have exhibited human-like intelligence by learning from vast amounts of internet-scale data.<n>This paper introduces AIPsychoBench, a specialized benchmark tailored to assess the psychological properties of LLM.
arXiv Detail & Related papers (2025-09-20T04:40:31Z)
Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems [10.227007419503297]
Large language models (LLMs) are increasingly revolutionizing evaluation methodologies across various human annotation tasks.<n>We conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains.<n>Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, the performance of conventional metrics.
arXiv Detail & Related papers (2025-07-23T07:51:56Z)
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics [7.902385931726113]
Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations.<n>We propose a logit-based ensemble method for estimating LLM consistency.
arXiv Detail & Related papers (2025-05-26T16:53:47Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs) Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science [0.18416014644193066]
Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers. We evaluate the performance of LLMs for systematic literature reviews.
arXiv Detail & Related papers (2025-03-16T05:52:18Z)
LLM-Mirror: A Generated-Persona Approach for Survey Pre-Testing [0.0]
We investigate whether providing respondents' prior information can replicate both statistical distributions and individual decision-making patterns. We also introduce the concept of the LLM-Mirror, user personas generated by supplying respondent-specific information to the LLM. Our findings show that: (1) PLS-SEM analysis shows LLM-generated responses align with human responses, (2) LLMs are capable of reproducing individual human responses, and (3) LLM-Mirror responses closely follow human responses at the individual level.
arXiv Detail & Related papers (2024-12-04T09:39:56Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Cognitive phantoms in LLMs through the lens of latent variables [0.3441021278275805]
Large language models (LLMs) increasingly reach real-world applications, necessitating a better understanding of their behaviour. Recent studies administering psychometric questionnaires to LLMs report human-like traits in LLMs, potentially influencing behaviour. This approach suffers from a validity problem: it presupposes that these traits exist in LLMs and that they are measurable with tools designed for humans. This study investigates this problem by comparing latent structures of personality between humans and three LLMs using two validated personality questionnaires.
arXiv Detail & Related papers (2024-09-06T12:42:35Z)
Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models [41.324679754114165]
Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making. We introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment.
arXiv Detail & Related papers (2024-07-22T14:02:59Z)
Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring [21.7782670140939]
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. This paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores.
arXiv Detail & Related papers (2024-07-04T22:26:20Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
Are Large Language Models Good Statisticians? [10.42853117200315]
StatQA is a new benchmark designed for statistical analysis tasks. We show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%. While open-source LLMs show limited capability, those fine-tuned ones exhibit marked improvements.
arXiv Detail & Related papers (2024-06-12T02:23:51Z)
On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
Preference Ranking Optimization for Human Alignment [90.6952059194946]
Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. We propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to fine-tune LLMs for human alignment.
arXiv Detail & Related papers (2023-06-30T09:07:37Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.