BRAINTEASER: Lateral Thinking Puzzles for Large Language Models
- URL: http://arxiv.org/abs/2310.05057v3
- Date: Thu, 9 Nov 2023 19:45:13 GMT
- Title: BRAINTEASER: Lateral Thinking Puzzles for Large Language Models
- Authors: Yifan Jiang, Filip Ilievski, Kaixin Ma, Zhivar Sourati
- Abstract summary: BRAINTEASER is a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking.
Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance.
We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
- Score: 15.95314613982879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of language models has inspired the NLP community to attend to
tasks that require implicit and complex reasoning, relying on human-like
commonsense mechanisms. While such vertical thinking tasks have been relatively
popular, lateral thinking puzzles have received little attention. To bridge
this gap, we devise BRAINTEASER: a multiple-choice Question Answering task
designed to test the model's ability to exhibit lateral thinking and defy
default commonsense associations. We design a three-step procedure for creating
the first lateral thinking benchmark, consisting of data collection, distractor
generation, and generation of adversarial examples, leading to 1,100 puzzles
with high-quality annotations. To assess the consistency of lateral reasoning
by models, we enrich BRAINTEASER based on a semantic and contextual
reconstruction of its questions. Our experiments with state-of-the-art
instruction- and commonsense language models reveal a significant gap between
human and model performance, which is further widened when consistency across
adversarial formats is considered. We make all of our code and data available
to stimulate work on developing and evaluating lateral thinking models.
Related papers
- COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes [14.603382370403]
We formulate visual lateral thinking as a multiple-choice question-answering task.
We describe a three-step taxonomy-driven methodology for instantiating task examples.
We develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles.
arXiv Detail & Related papers (2024-09-06T06:49:55Z) - Conceptual and Unbiased Reasoning in Language Models [98.90677711523645]
We propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions.
We show that existing large language models fall short on conceptual reasoning, dropping 9% to 28% on various benchmarks.
We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making.
arXiv Detail & Related papers (2024-03-30T00:53:53Z) - PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts.
We find that they are not able to generalize well to simple abstract patterns.
Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z) - Large Language Models as Analogical Reasoners [155.9617224350088]
Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks.
We introduce a new prompting approach, analogical prompting, designed to automatically guide the reasoning process of large language models.
arXiv Detail & Related papers (2023-10-03T00:57:26Z) - Investigating the Efficacy of Large Language Models in Reflective
Assessment Methods through Chain of Thoughts Prompting [0.2552922646705803]
Chain of Thought(CoT) prompting method has been proposed as a means to enhance LLMs' proficiency in complex reasoning tasks.
The primary aim of this research is to assess how well four language models can grade reflective essays of third-year medical students.
arXiv Detail & Related papers (2023-09-30T06:25:27Z) - MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation [62.44907105496227]
MindDial is a novel conversational framework that can generate situated free-form responses with theory-of-mind modeling.
We introduce an explicit mind module that can track the speaker's belief and the speaker's prediction of the listener's belief.
Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation.
arXiv Detail & Related papers (2023-06-27T07:24:32Z) - Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango [11.344587937052697]
This work initiates the preliminary steps towards a deeper understanding of reasoning mechanisms in large language models.
Our work centers around querying the model while controlling for all but one of the components in a prompt: symbols, patterns, and text.
We posit that text imbues patterns with commonsense knowledge and meaning.
arXiv Detail & Related papers (2022-09-16T02:54:00Z) - elBERto: Self-supervised Commonsense Learning for Question Answering [131.51059870970616]
We propose a Self-supervised Bidirectional Representation Learning of Commonsense framework, which is compatible with off-the-shelf QA model architectures.
The framework comprises five self-supervised tasks to force the model to fully exploit the additional training signals from contexts containing rich commonsense.
elBERto achieves substantial improvements on out-of-paragraph and no-effect questions where simple lexical similarity comparison does not help.
arXiv Detail & Related papers (2022-03-17T16:23:45Z) - Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought.
Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.