Related papers: Auxiliary task demands mask the capabilities of smaller language models

Auxiliary task demands mask the capabilities of smaller language models

URL: http://arxiv.org/abs/2404.02418v2
Date: Mon, 29 Jul 2024 20:29:32 GMT
Title: Auxiliary task demands mask the capabilities of smaller language models
Authors: Jennifer Hu, Michael C. Frank,
Abstract summary: We show that evaluation methods with greater task demands yield lower performance than evaluations with reduced demands. Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence.
Score: 2.938889003635811
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developmental psychologists have argued about when cognitive capacities such as language understanding or theory of mind emerge. These debates often hinge on the concept of "task demands" -- the auxiliary challenges associated with performing a particular evaluation -- that may mask the child's underlying ability. The same issues arise when measuring the capacities of language models (LMs): performance on a task is a function of the model's underlying knowledge, combined with the model's ability to interpret and perform the task given its available resources. Here, we show that for analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, evaluation methods with greater task demands yield lower performance than evaluations with reduced demands. This "demand gap" is most pronounced for models with fewer parameters and less training data. Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence (or lack thereof), but as a reflection of capacities seen through the lens of researchers' design choices.

Related papers

Pixels, Patterns, but No Poetry: To See The World like Humans [33.773551676022514]
State-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans.<n>This paper shifts focus from reasoning to perception.
arXiv Detail & Related papers (2025-07-21T21:50:16Z)
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy [53.07517728420411]
We introduce the first instruction database specifically focused on hallucinations in low-level vision tasks. We propose the Self-Awareness Failure Elimination (SAFEQA) model to improve the perception and comprehension abilities of the model in low-level vision tasks. We conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.
arXiv Detail & Related papers (2025-03-26T16:05:01Z)
Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm [9.577716124021029]
We argue that GPT 3.5's declining performance on 2-back and 3-back tasks reflects a working memory capacity limit similar to humans. By analyzing a range of open-source language models of varying performance levels on these tasks, we show that the poor performance instead reflects a limitation in task comprehension and task set maintenance.
arXiv Detail & Related papers (2024-12-24T03:06:52Z)
Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs) In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt. Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z)
SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL) SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z)
Evaluating the Deductive Competence of Large Language Models [0.2218292673050528]
We investigate whether several large language models (LLMs) can solve a classic type of deductive reasoning problem. We do find performance differences between conditions; however, they do not improve overall performance. We find that performance interacts with presentation format and content in unexpected ways that differ from human performance.
arXiv Detail & Related papers (2023-09-11T13:47:07Z)
Are Emergent Abilities in Large Language Models just In-Context Learning? [46.561464069450444]
We present a novel theory that explains emergent abilities, taking into account their potential confounding factors. Our findings suggest that purported emergent abilities are not truly emergent, but result from a combination of in-context learning, model memory, and linguistic knowledge.
arXiv Detail & Related papers (2023-09-04T20:54:11Z)
A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds? [2.7342737448775534]
Large Language Models (LLMs) have been linked to claims about human-like linguistic performance. We analyze the contribution of LLMs as theoretically informative representations of a target cognitive system. We evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing.
arXiv Detail & Related papers (2023-07-26T18:58:53Z)
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks [71.19560970717495]
Recent language models show impressive performance across a wide range of tasks. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? We propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks.
arXiv Detail & Related papers (2023-07-05T17:50:42Z)
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM) We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)
Post Hoc Explanations of Language Models Can Improve Language Models [43.2109029463221]
We present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY) We leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. Our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks.
arXiv Detail & Related papers (2023-05-19T04:46:04Z)
Define, Evaluate, and Improve Task-Oriented Cognitive Capabilities for Instruction Generation Models [5.975913042883176]
Recent work studies the cognitive capabilities of language models through psychological tests designed for humans. We formulate task-oriented cognitive capabilities, which are human-like cognitive capabilities that language models leverage to perform tasks.
arXiv Detail & Related papers (2022-12-21T04:43:19Z)
oLMpics -- On what Language Model Pre-training Captures [84.60594612120173]
We propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
arXiv Detail & Related papers (2019-12-31T12:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.