WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large
Language Models
- URL: http://arxiv.org/abs/2311.15930v1
- Date: Mon, 27 Nov 2023 15:38:17 GMT
- Title: WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large
Language Models
- Authors: Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya,
Xavier Martinet, Gr\'egoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke
Hupkes, Pascal Vincent
- Abstract summary: We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat)
We show that these models make errors even with as few as three objects.
Errors persist even with chain-of-thought prompting and in-context learning.
- Score: 35.088946378980914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose WorldSense, a benchmark designed to assess the extent to which
LLMs are consistently able to sustain tacit world models, by testing how they
draw simple inferences from descriptions of simple arrangements of entities.
Worldsense is a synthetic benchmark with three problem types, each with their
own trivial control, which explicitly avoids bias by decorrelating the abstract
structure of problems from the vocabulary and expressions, and by decorrelating
all problem subparts with the correct response. We run our benchmark on three
state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these
models make errors even with as few as three objects. Furthermore, they have
quite heavy response biases, preferring certain responses irrespective of the
question. Errors persist even with chain-of-thought prompting and in-context
learning. Lastly, we show that while finetuning on similar problems does result
in substantial improvements -- within- and out-of-distribution -- the finetuned
models do not generalise beyond a constraint problem space.
Related papers
- Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? [65.43882564649721]
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues.
We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty.
We analyze the potential for improvement in consistency by relative consistency score.
arXiv Detail & Related papers (2024-06-18T17:25:47Z) - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
We demonstrate a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales.
The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving.
We take these initial observations to stimulate urgent re-assessment of the claimed capabilities of current generation of Large Language Models.
arXiv Detail & Related papers (2024-06-04T07:43:33Z) - Frontier Language Models are not Robust to Adversarial Arithmetic, or
"What do I need to say so you agree 2+2=5? [88.59136033348378]
We study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment.
This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete.
We show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops.
arXiv Detail & Related papers (2023-11-08T19:07:10Z) - Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics
and Prompt Wording [0.0]
We analyze what confuses GPT-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response.
We find that GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes mistakes with common Misconceptions and Controversies.
The model responses are inconsistent across prompts and settings, highlighting GPT-3's unreliability.
arXiv Detail & Related papers (2023-06-09T19:07:31Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - On Reality and the Limits of Language Data: Aligning LLMs with Human
Norms [10.02997544238235]
Large Language Models (LLMs) harness linguistic associations in vast natural language data for practical applications.
We explore this question using a novel and tightly controlled reasoning test (ART) and compare human norms against versions of GPT-3.
Our findings highlight the categories of common-sense relations models that could learn directly from data and areas of weakness.
arXiv Detail & Related papers (2022-08-25T10:21:23Z) - Generalization of Neural Combinatorial Solvers Through the Lens of
Adversarial Robustness [68.97830259849086]
Most datasets only capture a simpler subproblem and likely suffer from spurious features.
We study adversarial robustness - a local generalization property - to reveal hard, model-specific instances and spurious features.
Unlike in other applications, where perturbation models are designed around subjective notions of imperceptibility, our perturbation models are efficient and sound.
Surprisingly, with such perturbations, a sufficiently expressive neural solver does not suffer from the limitations of the accuracy-robustness trade-off common in supervised learning.
arXiv Detail & Related papers (2021-10-21T07:28:11Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.