Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in
Large Language Models
- URL: http://arxiv.org/abs/2305.10782v3
- Date: Wed, 8 Nov 2023 12:39:51 GMT
- Title: Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in
Large Language Models
- Authors: Raj Sanjay Shah, Vijay Marupudi, Reba Koenen, Khushi Bhardwaj, Sashank
Varma
- Abstract summary: Large Language Models (LLMs) do not differentially represent numbers, which are pervasive in text.
In this work, we investigate how well popular LLMs capture the magnitudes of numbers from a behavioral lens.
- Score: 4.412336603162406
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) do not differentially represent numbers, which
are pervasive in text. In contrast, neuroscience research has identified
distinct neural representations for numbers and words. In this work, we
investigate how well popular LLMs capture the magnitudes of numbers (e.g., that
$4 < 5$) from a behavioral lens. Prior research on the representational
capabilities of LLMs evaluates whether they show human-level performance, for
instance, high overall accuracy on standard benchmarks. Here, we ask a
different question, one inspired by cognitive science: How closely do the
number representations of LLMscorrespond to those of human language users, who
typically demonstrate the distance, size, and ratio effects? We depend on a
linking hypothesis to map the similarities among the model embeddings of number
words and digits to human response times. The results reveal surprisingly
human-like representations across language models of different architectures,
despite the absence of the neural circuitry that directly supports these
representations in the human brain. This research shows the utility of
understanding LLMs using behavioral benchmarks and points the way to future
work on the number representations of LLMs and their cognitive plausibility.
Related papers
- Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models [36.983534612895156]
In the recent past, a popular way of evaluating natural language understanding (NLU) was to consider a model's ability to perform natural language inference (NLI) tasks.
This paper focuses on five different NLI benchmarks across six models of different scales.
We investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training.
arXiv Detail & Related papers (2024-11-21T13:09:36Z) - HLB: Benchmarking LLMs' Humanlikeness in Language Use [2.438748974410787]
We present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs)
We collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments.
Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels.
arXiv Detail & Related papers (2024-09-24T09:02:28Z) - What Are Large Language Models Mapping to in the Brain? A Case Against Over-Reliance on Brain Scores [1.8175282137722093]
Internal representations from large language models (LLMs) achieve state-of-the-art brain scores, leading to speculation that they share computational principles with human language processing.
Here, we analyze three neural datasets used in an impactful study on LLM-to-brain mappings, with a particular focus on an fMRI dataset where participants read short passages.
We find that brain scores of trained LLMs on this dataset can largely be explained by sentence length, position, and pronoun-dereferenced static word embeddings.
arXiv Detail & Related papers (2024-06-03T17:13:27Z) - Will the Real Linda Please Stand up...to Large Language Models? Examining the Representativeness Heuristic in LLMs [7.100094213474042]
Large language models (LLMs) have demonstrated remarkable proficiency in modeling text and generating human-like text.
LLMs may be susceptible to a common cognitive trap in human decision-making called the representativeness.
This research investigates the impact of the representativeness on LLM reasoning.
arXiv Detail & Related papers (2024-04-01T20:15:06Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Divergences between Language Models and Human Brains [63.405788999891335]
Recent research has hinted that brain signals can be effectively predicted using internal representations of language models (LMs)
We show that there are clear differences in how LMs and humans represent and use language.
We identify two domains that are not captured well by LMs: social/emotional intelligence and physical commonsense.
arXiv Detail & Related papers (2023-11-15T19:02:40Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - Roles of Scaling and Instruction Tuning in Language Perception: Model
vs. Human Attention [58.817405319722596]
This work compares the self-attention of several large language models (LLMs) in different sizes to assess the effect of scaling and instruction tuning on language perception.
Results show that scaling enhances the human resemblance and improves the effective attention by reducing the trivial pattern reliance, while instruction tuning does not.
We also find that current LLMs are consistently closer to non-native than native speakers in attention, suggesting a sub-optimal language perception of all models.
arXiv Detail & Related papers (2023-10-29T17:16:40Z) - Probing Large Language Models from A Human Behavioral Perspective [24.109080140701188]
Large Language Models (LLMs) have emerged as dominant foundational models in modern NLP.
The understanding of their prediction processes and internal mechanisms, such as feed-forward networks (FFN) and multi-head self-attention (MHSA) remains largely unexplored.
arXiv Detail & Related papers (2023-10-08T16:16:21Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.