Related papers: Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

URL: http://arxiv.org/abs/2407.15645v1
Date: Mon, 22 Jul 2024 14:02:59 GMT
Title: Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models
Authors: Joy He-Yueya, Wanjing Anya Ma, Kanishk Gandhi, Benjamin W. Domingue, Emma Brunskill, Noah D. Goodman,
Abstract summary: Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making. We introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment.
Score: 41.324679754114165
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.

Related papers

MF-LLM: Simulating Population Decision Dynamics via a Mean-Field Large Language Model Framework [53.82097200295448]
Mean-Field LLM (MF-LLM) is first to incorporate mean field theory into social simulation.<n>MF-LLM models bidirectional interactions between individuals and the population through an iterative process.<n> IB-Tune is a novel fine-tuning method inspired by the Information Bottleneck principle.
arXiv Detail & Related papers (2025-04-30T12:41:51Z)
LLM-Mirror: A Generated-Persona Approach for Survey Pre-Testing [0.0]
We investigate whether providing respondents' prior information can replicate both statistical distributions and individual decision-making patterns. We also introduce the concept of the LLM-Mirror, user personas generated by supplying respondent-specific information to the LLM. Our findings show that: (1) PLS-SEM analysis shows LLM-generated responses align with human responses, (2) LLMs are capable of reproducing individual human responses, and (3) LLM-Mirror responses closely follow human responses at the individual level.
arXiv Detail & Related papers (2024-12-04T09:39:56Z)
Benchmarking Distributional Alignment of Large Language Models [43.0198231524816]
Language models (LMs) are increasingly used as simulacra for people, yet their ability to match the distribution of views of a specific demographic group remains uncertain. We construct a dataset expanding beyond political values, create human baselines for this task, and evaluate the extent to which an LM can align with a particular group's opinion distribution. Our analysis reveals open problems regarding if, and how, LMs can be used to simulate humans, and that LLMs can more accurately describe the opinion distribution than simulate such distributions.
arXiv Detail & Related papers (2024-11-08T08:41:17Z)
Uncovering Factor Level Preferences to Improve Human-Model Alignment [58.50191593880829]
We introduce PROFILE, a framework that uncovers and quantifies the influence of specific factors driving preferences. ProFILE's factor level analysis explains the 'why' behind human-model alignment and misalignment. We demonstrate how leveraging factor level insights, including addressing misaligned factors, can improve alignment with human preferences.
arXiv Detail & Related papers (2024-10-09T15:02:34Z)
Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction. We find that contextual characteristics significantly affect human reliance behavior. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z)
HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants [16.932009464531742]
We propose a novel automatic evaluation task: HumanRankEval. It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning.
arXiv Detail & Related papers (2024-05-15T08:47:26Z)
Explaining Large Language Models Decisions Using Shapley Values [1.223779595809275]
Large language models (LLMs) have opened up exciting possibilities for simulating human behavior and cognitive processes. However, the validity of utilizing LLMs as stand-ins for human subjects remains uncertain. This paper presents a novel approach based on Shapley values to interpret LLM behavior and quantify the relative contribution of each prompt component to the model's output.
arXiv Detail & Related papers (2024-03-29T22:49:43Z)
Dive into the Chasm: Probing the Gap between In- and Cross-Topic Generalization [66.4659448305396]
This study analyzes various LMs with three probing-based experiments to shed light on the reasons behind the In- vs. Cross-Topic generalization gap. We demonstrate, for the first time, that generalization gaps and the robustness of the embedding space vary significantly across LMs.
arXiv Detail & Related papers (2024-02-02T12:59:27Z)
On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)
Divergences between Language Models and Human Brains [59.100552839650774]
We systematically explore the divergences between human and machine language processing. We identify two domains that LMs do not capture well: social/emotional intelligence and physical commonsense. Our results show that fine-tuning LMs on these domains can improve their alignment with human brain responses.
arXiv Detail & Related papers (2023-11-15T19:02:40Z)
Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans. We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)
Predicting Human Psychometric Properties Using Computational Language Models [5.806723407090421]
Transformer-based language models (LMs) continue to achieve state-of-the-art performance on natural language processing (NLP) benchmarks. Can LMs be of use in predicting the psychometric properties of test items, when those items are given to human participants? We gather responses from numerous human participants and LMs on a broad diagnostic test of linguistic competencies. We then use the human responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately.
arXiv Detail & Related papers (2022-05-12T16:40:12Z)
Can Transformer Language Models Predict Psychometric Properties? [0.0]
Transformer-based language models (LMs) continue to advance state-of-the-art performance on NLP benchmark tasks. Can LMs be of use in predicting what the psychometric properties of test items will be when those items are given to human participants? We gather responses from numerous human participants and LMs on a broad diagnostic test of linguistic competencies.
arXiv Detail & Related papers (2021-06-12T20:05:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.