Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm
- URL: http://arxiv.org/abs/2412.18120v2
- Date: Thu, 26 Dec 2024 16:31:53 GMT
- Title: Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm
- Authors: Xiaoyang Hu, Richard L. Lewis,
- Abstract summary: We argue that GPT 3.5's declining performance on 2-back and 3-back tasks reflects a working memory capacity limit similar to humans.
By analyzing a range of open-source language models of varying performance levels on these tasks, we show that the poor performance instead reflects a limitation in task comprehension and task set maintenance.
- Score: 9.577716124021029
- License:
- Abstract: Cognitive tasks originally developed for humans are now increasingly used to study language models. While applying these tasks is often straightforward, interpreting their results can be challenging. In particular, when a model underperforms, it is often unclear whether this results from a limitation in the cognitive ability being tested or a failure to understand the task itself. A recent study argues that GPT 3.5's declining performance on 2-back and 3-back tasks reflects a working memory capacity limit similar to humans (Gong et al., 2024). By analyzing a range of open-source language models of varying performance levels on these tasks, we show that the poor performance instead reflects a limitation in task comprehension and task set maintenance. In addition, we challenge the best-performing model with progressively harder versions of the task (up to 10-back) and experiment with alternative prompting strategies, before analyzing model attentions. Our larger aim is to contribute to the ongoing conversation around refining methodologies for the cognitive evaluation of language models.
Related papers
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse [9.542503507653494]
Chain-of-thought (CoT) has become a widely used strategy for working with large language and multimodal models.
We identify characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology.
We find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance when using inference-time reasoning.
arXiv Detail & Related papers (2024-10-27T18:30:41Z) - Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks.
Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z) - Auxiliary task demands mask the capabilities of smaller language models [2.938889003635811]
We show that evaluation methods with greater task demands yield lower performance than evaluations with reduced demands.
Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence.
arXiv Detail & Related papers (2024-04-03T02:56:52Z) - SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL)
SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z) - Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks [71.19560970717495]
Recent language models show impressive performance across a wide range of tasks.
Are these skills general and transferable, or specialized to specific tasks seen during pretraining?
We propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks.
arXiv Detail & Related papers (2023-07-05T17:50:42Z) - Define, Evaluate, and Improve Task-Oriented Cognitive Capabilities for
Instruction Generation Models [5.975913042883176]
Recent work studies the cognitive capabilities of language models through psychological tests designed for humans.
We formulate task-oriented cognitive capabilities, which are human-like cognitive capabilities that language models leverage to perform tasks.
arXiv Detail & Related papers (2022-12-21T04:43:19Z) - Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models [648.3665819567409]
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale.
Big-bench consists of 204 tasks, contributed by 450 authors across 132 institutions.
We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench.
arXiv Detail & Related papers (2022-06-09T17:05:34Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - A Closer Look at Linguistic Knowledge in Masked Language Models: The
Case of Relative Clauses in American English [17.993417004424078]
Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on.
We evaluate three models (BERT, RoBERTa, and ALBERT) testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks.
arXiv Detail & Related papers (2020-11-02T13:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.