Related papers: Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratings

Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratings

URL: http://arxiv.org/abs/2509.14405v1
Date: Wed, 17 Sep 2025 20:11:23 GMT
Title: Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratings
Authors: Javier Conde, María Grandury, Tairan Fu, Carlos Arriaga, Gonzalo Martínez, Thomas Clark, Sean Trott, Clarence Gerald Green, Pedro Reviriego, Marc Brysbaert,
Abstract summary: We present a comprehensive methodology for estimating word characteristics with Large Language Models (LLMs)<n>A major emphasis in the guide is the validation of LLM-generated data with human "gold standard" norms.<n>We also present a software framework that implements our methodology and supports both commercial and open-weight models.
Score: 5.019061035507826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Word-level psycholinguistic norms lend empirical support to theories of language processing. However, obtaining such human-based measures is not always feasible or straightforward. One promising approach is to augment human norming datasets by using Large Language Models (LLMs) to predict these characteristics directly, a practice that is rapidly gaining popularity in psycholinguistics and cognitive science. However, the novelty of this approach (and the relative inscrutability of LLMs) necessitates the adoption of rigorous methodologies that guide researchers through this process, present the range of possible approaches, and clarify limitations that are not immediately apparent, but may, in some cases, render the use of LLMs impractical. In this work, we present a comprehensive methodology for estimating word characteristics with LLMs, enriched with practical advice and lessons learned from our own experience. Our approach covers both the direct use of base LLMs and the fine-tuning of models, an alternative that can yield substantial performance gains in certain scenarios. A major emphasis in the guide is the validation of LLM-generated data with human "gold standard" norms. We also present a software framework that implements our methodology and supports both commercial and open-weight models. We illustrate the proposed approach with a case study on estimating word familiarity in English. Using base models, we achieved a Spearman correlation of 0.8 with human ratings, which increased to 0.9 when employing fine-tuned models. This methodology, framework, and set of best practices aim to serve as a reference for future research on leveraging LLMs for psycholinguistic and lexical studies.

Related papers

Nonparametric LLM Evaluation from Preference Data [86.96268870461472]
We propose a nonparametric statistical framework, DMLEval, for comparing and ranking large language models (LLMs) from preference data.<n>Our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.
arXiv Detail & Related papers (2026-01-29T15:00:07Z)
A word association network methodology for evaluating implicit biases in LLMs compared to humans [0.0]
We present a novel word association network methodology for evaluating implicit biases in Large language models (LLMs)<n>Our method taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias.<n>To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party.
arXiv Detail & Related papers (2025-10-28T15:03:18Z)
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [54.38054999271322]
We show that large language models (LLMs) don't update their beliefs as expected from the Bayesian framework.<n>We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the normative Bayesian model.<n>More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
arXiv Detail & Related papers (2025-03-21T20:13:04Z)
Investigating Privacy Bias in Training Data of Language Models [1.3167450470598043]
A privacy bias refers to the skew in the appropriateness of information flows within a given context.<n>This skew may either align with existing expectations or signal a symptom of systemic issues.<n>We present a novel approach to assess the privacy biases using a contextual integrity-based methodology.
arXiv Detail & Related papers (2024-09-05T17:50:31Z)
Can DPO Learn Diverse Human Values? A Theoretical Scaling Law [7.374590753074647]
Preference learning trains models to distinguish between preferred and non-preferred responses based on human feedback.<n>This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity.<n>Our framework rigorously assesses how well models generalize after a finite number of gradient steps.
arXiv Detail & Related papers (2024-08-06T22:11:00Z)
A Survey on Human Preference Learning for Large Language Models [81.41868485811625]
The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning. This survey covers the sources and formats of preference feedback, the modeling and usage of preference signals, as well as the evaluation of the aligned LLMs.
arXiv Detail & Related papers (2024-06-17T03:52:51Z)
CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques. We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
Pedagogical Alignment of Large Language Models [24.427653091950994]
Large Language Models (LLMs) provide immediate answers rather than guiding students through the problem-solving process. This paper investigates Learning from Human Preferences (LHP) algorithms to achieve this alignment objective.
arXiv Detail & Related papers (2024-02-07T16:15:59Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks. Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information. This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.