Can language models handle recursively nested grammatical structures? A
case study on comparing models and humans
- URL: http://arxiv.org/abs/2210.15303v1
- Date: Thu, 27 Oct 2022 10:25:12 GMT
- Title: Can language models handle recursively nested grammatical structures? A
case study on comparing models and humans
- Authors: Andrew Kyle Lampinen
- Abstract summary: How should we compare the capabilities of language models and humans?
I consider a case study: processing of nested grammatical structures.
I suggest that there is an important difference between evaluating cognitive models of a specific phenomenon and evaluating broadly-trained models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How should we compare the capabilities of language models and humans? Here, I
consider a case study: processing of recursively nested grammatical structures.
Prior work has suggested that language models cannot handle these structures as
reliably as humans can. However, the humans were provided with instructions and
training before being evaluated, while the language models were evaluated
zero-shot. I therefore attempt to more closely match the evaluation paradigms
by providing language models with few-shot prompts. A simple prompt, which
contains substantially less content than the human training, allows large
language models to consistently outperform the human results. The same prompt
even allows extrapolation to more-deeply-nested conditions than have been
tested in humans. Further, a reanalysis of the prior human experiments suggests
that the humans may not perform above chance at the difficult structures
initially. These results suggest that large language models can in fact process
recursively nested grammatical structures comparably to humans. This case study
highlights how discrepancies in the quantity of experiment-specific context can
confound comparisons of language models and humans. I use this case study to
reflect on the broader challenge of comparing human and model capabilities, and
to suggest that there is an important difference between evaluating cognitive
models of a specific phenomenon and evaluating broadly-trained models.
Related papers
- DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.
We show that DevBench provides a benchmark for comparing models to human language development.
These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Rarely a problem? Language models exhibit inverse scaling in their
predictions following few-type quantifiers [0.6091702876917281]
We focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models.
We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes.
arXiv Detail & Related papers (2022-12-16T20:01:22Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - A Targeted Assessment of Incremental Processing in Neural LanguageModels
and Humans [2.7624021966289605]
We present a scaled-up comparison of incremental processing in humans and neural language models.
Data comes from a novel online experimental paradigm called the Interpolated Maze task.
We find that both humans and language models show increased processing difficulty in ungrammatical sentence regions.
arXiv Detail & Related papers (2021-06-06T20:04:39Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z) - Recurrent Neural Network Language Models Always Learn English-Like
Relative Clause Attachment [17.995905582226463]
We compare model performance in English and Spanish to show that non-linguistic biases in RNN LMs advantageously overlap with syntactic structure in English but not Spanish.
English models may appear to acquire human-like syntactic preferences, while models trained on Spanish fail to acquire comparable human-like preferences.
arXiv Detail & Related papers (2020-05-01T01:21:47Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.