Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
- URL: http://arxiv.org/abs/2406.02061v4
- Date: Sat, 13 Jul 2024 21:02:21 GMT
- Title: Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
- Authors: Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev,
- Abstract summary: We demonstrate a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales.
The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving.
We take these initial observations to stimulate urgent re-assessment of the claimed capabilities of current generation of Large Language Models.
- Score: 13.532180752491954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem (AIW problem) formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving, also expressing strong overconfidence in the wrong solutions, often backed up by plausible sounding explanation-like confabulations. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs. Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/AIW
Related papers
- Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence [3.4049215220521933]
We introduce Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models.
The framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts.
The accompanying dataset, DIA-Bench, contains a diverse collection of challenge templates with mutable parameters presented in various formats.
arXiv Detail & Related papers (2024-10-20T20:07:36Z) - Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.
We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z) - E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models [29.763745375790933]
Large language models (LLMs) are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model.
We evaluate the ease-of-use of LLMs and construct E-Bench, simulating the actual situation of human use.
arXiv Detail & Related papers (2024-06-16T14:08:30Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues.
We propose a novel approach to address this issue at test time without requiring retraining.
MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z) - Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks.
Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - TsmoBN: Interventional Generalization for Unseen Clients in Federated
Learning [23.519212374186232]
We form a training structural causal model (SCM) to explain the challenges of model generalization in a distributed learning paradigm.
We present a simple yet effective method using test-specific and momentum tracked batch normalization (TsmoBN) to generalize FL models to testing clients.
arXiv Detail & Related papers (2021-10-19T13:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.