Related papers: BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

URL: http://arxiv.org/abs/2310.05057v3
Date: Thu, 9 Nov 2023 19:45:13 GMT
Title: BRAINTEASER: Lateral Thinking Puzzles for Large Language Models
Authors: Yifan Jiang, Filip Ilievski, Kaixin Ma, Zhivar Sourati
Abstract summary: BRAINTEASER is a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
Score: 15.95314613982879
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.

Related papers

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z)
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge [45.20691825097646]
We introduce VisualPuzzles, a benchmark that targets visual reasoning. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning.
arXiv Detail & Related papers (2025-04-14T15:50:39Z)
The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question. We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation. We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z)
Cross-Modal Consistency in Multimodal Large Language Models [33.229271701817616]
We introduce a novel concept termed cross-modal consistency. Our experimental findings reveal a pronounced inconsistency between the vision and language modalities within GPT-4V. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
arXiv Detail & Related papers (2024-11-14T08:22:42Z)
COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes [14.603382370403]
We formulate visual lateral thinking as a multiple-choice question-answering task. We describe a three-step taxonomy-driven methodology for instantiating task examples. We develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles.
arXiv Detail & Related papers (2024-09-06T06:49:55Z)
Conceptual and Unbiased Reasoning in Language Models [98.90677711523645]
We propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions. We show that existing large language models fall short on conceptual reasoning, dropping 9% to 28% on various benchmarks. We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making.
arXiv Detail & Related papers (2024-03-30T00:53:53Z)
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts. We find that they are not able to generalize well to simple abstract patterns. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z)
Large Language Models as Analogical Reasoners [155.9617224350088]
Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks. We introduce a new prompting approach, analogical prompting, designed to automatically guide the reasoning process of large language models.
arXiv Detail & Related papers (2023-10-03T00:57:26Z)
Investigating the Efficacy of Large Language Models in Reflective Assessment Methods through Chain of Thoughts Prompting [0.2552922646705803]
Chain of Thought(CoT) prompting method has been proposed as a means to enhance LLMs' proficiency in complex reasoning tasks. The primary aim of this research is to assess how well four language models can grade reflective essays of third-year medical students.
arXiv Detail & Related papers (2023-09-30T06:25:27Z)
MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation [62.44907105496227]
MindDial is a novel conversational framework that can generate situated free-form responses with theory-of-mind modeling. We introduce an explicit mind module that can track the speaker's belief and the speaker's prediction of the listener's belief. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation.
arXiv Detail & Related papers (2023-06-27T07:24:32Z)
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango [11.344587937052697]
This work initiates the preliminary steps towards a deeper understanding of reasoning mechanisms in large language models. Our work centers around querying the model while controlling for all but one of the components in a prompt: symbols, patterns, and text. We posit that text imbues patterns with commonsense knowledge and meaning.
arXiv Detail & Related papers (2022-09-16T02:54:00Z)
elBERto: Self-supervised Commonsense Learning for Question Answering [131.51059870970616]
We propose a Self-supervised Bidirectional Representation Learning of Commonsense framework, which is compatible with off-the-shelf QA model architectures. The framework comprises five self-supervised tasks to force the model to fully exploit the additional training signals from contexts containing rich commonsense. elBERto achieves substantial improvements on out-of-paragraph and no-effect questions where simple lexical similarity comparison does not help.
arXiv Detail & Related papers (2022-03-17T16:23:45Z)
Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.