Large Language Models for Psycholinguistic Plausibility Pretesting
- URL: http://arxiv.org/abs/2402.05455v1
- Date: Thu, 8 Feb 2024 07:20:02 GMT
- Title: Large Language Models for Psycholinguistic Plausibility Pretesting
- Authors: Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
- Abstract summary: We investigate whether Language Models (LMs) can be used to generate plausibility judgements.
We find that GPT-4 plausibility judgements highly correlate with human judgements across the structures we examine.
We then test whether this correlation implies that LMs can be used instead of humans for pretesting.
- Score: 47.1250032409564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In psycholinguistics, the creation of controlled materials is crucial to
ensure that research outcomes are solely attributed to the intended
manipulations and not influenced by extraneous factors. To achieve this,
psycholinguists typically pretest linguistic materials, where a common pretest
is to solicit plausibility judgments from human evaluators on specific
sentences. In this work, we investigate whether Language Models (LMs) can be
used to generate these plausibility judgements. We investigate a wide range of
LMs across multiple linguistic structures and evaluate whether their
plausibility judgements correlate with human judgements. We find that GPT-4
plausibility judgements highly correlate with human judgements across the
structures we examine, whereas other LMs correlate well with humans on commonly
used syntactic structures. We then test whether this correlation implies that
LMs can be used instead of humans for pretesting. We find that when
coarse-grained plausibility judgements are needed, this works well, but when
fine-grained judgements are necessary, even GPT-4 does not provide satisfactory
discriminative power.
Related papers
- HLB: Benchmarking LLMs' Humanlikeness in Language Use [2.438748974410787]
We present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs)
We collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments.
Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels.
arXiv Detail & Related papers (2024-09-24T09:02:28Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Challenging the Validity of Personality Tests for Large Language Models [2.9123921488295768]
Large language models (LLMs) behave increasingly human-like in text-based interactions.
LLMs' responses to personality tests systematically deviate from human responses.
arXiv Detail & Related papers (2023-11-09T11:54:01Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - DecipherPref: Analyzing Influential Factors in Human Preference
Judgments via GPT-4 [28.661237196238996]
We conduct an in-depth examination of a collection of pairwise human judgments released by OpenAI.
We find that the most favored factors vary across tasks and genres, whereas the least favored factors tend to be consistent.
Our findings have implications on the construction of balanced datasets in human preference evaluations.
arXiv Detail & Related papers (2023-05-24T04:13:15Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Are Representations Built from the Ground Up? An Empirical Examination
of Local Composition in Language Models [91.3755431537592]
Representing compositional and non-compositional phrases is critical for language understanding.
We first formulate a problem of predicting the LM-internal representations of longer phrases given those of their constituents.
While we would expect the predictive accuracy to correlate with human judgments of semantic compositionality, we find this is largely not the case.
arXiv Detail & Related papers (2022-10-07T14:21:30Z) - Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish.
Using our approach, we isolate morpho-syntactic features from counfounders in sentences.
We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z) - Predicting Human Psychometric Properties Using Computational Language
Models [5.806723407090421]
Transformer-based language models (LMs) continue to achieve state-of-the-art performance on natural language processing (NLP) benchmarks.
Can LMs be of use in predicting the psychometric properties of test items, when those items are given to human participants?
We gather responses from numerous human participants and LMs on a broad diagnostic test of linguistic competencies.
We then use the human responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately.
arXiv Detail & Related papers (2022-05-12T16:40:12Z) - Do language models learn typicality judgments from text? [6.252236971703546]
We evaluate predictive language models (LMs) on a prevalent phenomenon in cognitive science: typicality.
Our first test targets whether typicality modulates LMs in assigning taxonomic category memberships to items.
The second test investigates sensitivities to typicality in LMs' probabilities when extending new information about items to their categories.
arXiv Detail & Related papers (2021-05-06T21:56:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.