Related papers: Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

URL: http://arxiv.org/abs/2404.14883v2
Date: Mon, 07 Oct 2024 11:37:44 GMT
Title: Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans
Authors: Vittoria Dentella, Fritz Guenther, Evelina Leivada,
Abstract summary: This work investigates the role of model scaling in determining whether differences between humans and models are amenable to model size. We test three Large Language Models (LLMs) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences.
Score: 1.8434042562191815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.

Related papers

Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity [0.0]
Human participants and four instruction-tuned LLMs rated 240 dialogue-based sentences representing six linguistic traits.<n>Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level.<n>GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions.
arXiv Detail & Related papers (2026-01-14T00:13:00Z)
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task [0.0]
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks.<n>This study examines whether LLMs can approximate individual differences in the phonemic fluency task.
arXiv Detail & Related papers (2025-05-22T03:08:27Z)
A suite of LMs comprehend puzzle statements as well as humans [13.386647125288516]
We report a preregistered study comparing human responses in two conditions: one allowed rereading, and one that restricted rereading.<n>Human accuracy dropped significantly when rereading was restricted, falling below that of Falcon-180B-Chat and GPT-4.<n>Results suggest shared pragmatic sensitivities rather than model-specific deficits.
arXiv Detail & Related papers (2025-05-13T22:18:51Z)
Making LLMs Reason? The Intermediate Language Problem in Neurosymbolic Approaches [49.567092222782435]
We introduce the intermediate language problem, which is the problem of choosing a suitable formal language representation for neurosymbolic approaches. We show a maximum difference in overall-accuracy of 53.20% and 49.26% in execution-accuracy. When using the GPT4o-mini LLM we beat the state-of-the-art in overall-accuracy on the ProntoQA dataset by 21.20% and by 50.50% on the ProofWriter dataset.
arXiv Detail & Related papers (2025-02-24T14:49:52Z)
SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment [78.4550589538805]
We propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs.
arXiv Detail & Related papers (2025-01-07T10:29:43Z)
A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans [3.3311266423308252]
We introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. Our results reveal a significant knowledge gap between humans and models for almost all semantic relations.
arXiv Detail & Related papers (2024-12-02T05:11:34Z)
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models [36.983534612895156]
In the recent past, a popular way of evaluating natural language understanding (NLU) was to consider a model's ability to perform natural language inference (NLI) tasks. This paper focuses on five different NLI benchmarks across six models of different scales. We investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training.
arXiv Detail & Related papers (2024-11-21T13:09:36Z)
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks. We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K. We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Perceptions of Linguistic Uncertainty by Language Models and Humans [26.69714008538173]
We investigate how language models map linguistic expressions of uncertainty to numerical responses. We find that 7 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge.
arXiv Detail & Related papers (2024-07-22T17:26:12Z)
Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction. We find that contextual characteristics significantly affect human reliance behavior. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z)
Grammaticality Representation in ChatGPT as Compared to Linguists and Laypeople [0.0]
This study builds upon a previous study that collected laypeople's grammatical judgments on 148 linguistic phenomena. Our primary focus was to compare ChatGPT with both laypeople and linguists in the judgement of these linguistic constructions. Overall, our findings demonstrate convergence rates ranging from 73% to 95% between ChatGPT and linguists, with an overall point-estimate of 89%.
arXiv Detail & Related papers (2024-06-17T00:23:16Z)
Roles of Scaling and Instruction Tuning in Language Perception: Model vs. Human Attention [58.817405319722596]
This work compares the self-attention of several large language models (LLMs) in different sizes to assess the effect of scaling and instruction tuning on language perception. Results show that scaling enhances the human resemblance and improves the effective attention by reducing the trivial pattern reliance, while instruction tuning does not. We also find that current LLMs are consistently closer to non-native than native speakers in attention, suggesting a sub-optimal language perception of all models.
arXiv Detail & Related papers (2023-10-29T17:16:40Z)
Do Large Language Models Show Decision Heuristics Similar to Humans? A Case Study Using GPT-3.5 [0.0]
GPT-3.5 is an example of an LLM that supports a conversational agent called ChatGPT. In this work, we used a series of novel prompts to determine whether ChatGPT shows biases, and other decision effects. We also tested the same prompts on human participants.
arXiv Detail & Related papers (2023-05-08T01:02:52Z)
Massively Multilingual Shallow Fusion with Large Language Models [62.76735265311028]
We train a single multilingual language model (LM) for shallow fusion in multiple languages. Compared to a dense LM of similar computation during inference, GLaM reduces the WER of an English long-tail test set by 4.4% relative. In a multilingual shallow fusion task, GLaM improves 41 out of 50 languages with an average relative WER reduction of 3.85%, and a maximum reduction of 10%.
arXiv Detail & Related papers (2023-02-17T14:46:38Z)
A fine-grained comparison of pragmatic language understanding in humans and language models [2.231167375820083]
We compare language models and humans on seven pragmatic phenomena. We find that the largest models achieve high accuracy and match human error patterns. Preliminary evidence that models and humans are sensitive to similar linguistic cues.
arXiv Detail & Related papers (2022-12-13T18:34:59Z)
Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data. This may cause the models to grasp cultural values including moral judgments from the high-resource languages. The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z)
Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages. Our largest model sets new state of the art in few-shot learning in more than 20 representative languages. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.