Related papers: HLB: Benchmarking LLMs' Humanlikeness in Language Use

HLB: Benchmarking LLMs' Humanlikeness in Language Use

URL: http://arxiv.org/abs/2409.15890v1
Date: Tue, 24 Sep 2024 09:02:28 GMT
Title: HLB: Benchmarking LLMs' Humanlikeness in Language Use
Authors: Xufeng Duan, Bei Xiao, Xuemei Tang, Zhenguang G. Cai,
Abstract summary: We present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) We collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels.
Score: 2.438748974410787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see https://huggingface.co/spaces/XufengDuan/HumanLikeness). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use.

Related papers

Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility [7.183662547358301]
We examine whether large language models process language similarly to humans. We find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation.
arXiv Detail & Related papers (2025-03-21T23:25:42Z)
Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning [49.60849499134362]
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs' true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers.
arXiv Detail & Related papers (2024-11-12T04:16:44Z)
LMLPA: Language Model Linguistic Personality Assessment [11.599282127259736]
Large Language Models (LLMs) are increasingly used in everyday life and research. measuring the personality of a given LLM is currently a challenge. This paper introduces the Language Model Linguistic Personality Assessment (LMLPA), a system designed to evaluate the linguistic personalities of LLMs.
arXiv Detail & Related papers (2024-10-23T07:48:51Z)
Cross-lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models [16.0617753653454]
This study presents a comparative analysis between human performance and SSL models. We also compare the SER ability of models and humans at both utterance- and segment-levels. Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers.
arXiv Detail & Related papers (2024-09-25T13:27:17Z)
DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity [5.388338680646657]
We show that GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features. We propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions. Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements.
arXiv Detail & Related papers (2024-08-30T21:33:58Z)
Virtual Personas for Language Models via an Anthology of Backstories [5.2112564466740245]
"Anthology" is a method for conditioning large language models to particular virtual personas by harnessing open-ended life narratives. We show that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-populations.
arXiv Detail & Related papers (2024-07-09T06:11:18Z)
How Does Quantization Affect Multilingual LLMs? [50.867324914368524]
Quantization techniques are widely used to improve inference speed and deployment of large language models. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales.
arXiv Detail & Related papers (2024-07-03T15:39:40Z)
High-Dimension Human Value Representation in Large Language Models [60.33033114185092]
We propose UniVaR, a high-dimensional representation of human value distributions in Large Language Models (LLMs) We show that UniVaR is a powerful tool to compare the distribution of human values embedded in different LLMs with different langauge sources.
arXiv Detail & Related papers (2024-04-11T16:39:00Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Divergences between Language Models and Human Brains [59.100552839650774]
We systematically explore the divergences between human and machine language processing. We identify two domains that LMs do not capture well: social/emotional intelligence and physical commonsense. Our results show that fine-tuning LMs on these domains can improve their alignment with human brain responses.
arXiv Detail & Related papers (2023-11-15T19:02:40Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Estimating Subjective Crowd-Evaluations as an Additional Objective to Improve Natural Language Generation [0.0]
We use a crowd-authored dialogue corpus to fine-tune six different language generation models. Two of these models incorporate multi-task learning and use subjective ratings of lines as part of an explicit learning goal. A human evaluation of the generated dialogue lines reveals that utterances generated by the multi-tasking models were subjectively rated as the most typical, most moving the conversation forward, and least offensive.
arXiv Detail & Related papers (2021-04-12T06:33:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.