Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- URL: http://arxiv.org/abs/2307.02477v3
- Date: Thu, 28 Mar 2024 23:37:24 GMT
- Title: Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- Authors: Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, Yoon Kim,
- Abstract summary: Recent language models show impressive performance across a wide range of tasks.
Are these skills general and transferable, or specialized to specific tasks seen during pretraining?
We propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks.
- Score: 71.19560970717495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.
Related papers
- Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks.
Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z) - Auxiliary task demands mask the capabilities of smaller language models [2.938889003635811]
We show that evaluation methods with greater task demands yield lower performance than evaluations with reduced demands.
Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence.
arXiv Detail & Related papers (2024-04-03T02:56:52Z) - Active Instruction Tuning: Improving Cross-Task Generalization by
Training on Prompt Sensitive Tasks [101.40633115037983]
Instruction tuning (IT) achieves impressive zero-shot generalization results by training large language models (LLMs) on a massive amount of diverse tasks with instructions.
How to select new tasks to improve the performance and generalizability of IT models remains an open question.
We propose active instruction tuning based on prompt uncertainty, a novel framework to identify informative tasks, and then actively tune the models on the selected tasks.
arXiv Detail & Related papers (2023-11-01T04:40:05Z) - Large Language Models Are Not Strong Abstract Reasoners [12.354660792999269]
Large Language Models have shown tremendous performance on a variety of natural language processing tasks.
It is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circumscribed.
We introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks.
arXiv Detail & Related papers (2023-05-31T04:50:29Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - Probing as Quantifying the Inductive Bias of Pre-trained Representations [99.93552997506438]
We present a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task.
We apply our framework to a series of token-, arc-, and sentence-level tasks.
arXiv Detail & Related papers (2021-10-15T22:01:16Z) - Supervised Bayesian Specification Inference from Demonstrations [11.855400596862275]
We present a probabilistic model for inferring task specification as a temporal logic formula.
We demonstrate the efficacy of our model for inferring specifications, with over 90% similarity observed between the inferred specification and the ground truth.
arXiv Detail & Related papers (2021-07-06T21:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.