Related papers: Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

URL: http://arxiv.org/abs/2505.10844v2
Date: Fri, 30 May 2025 03:59:03 GMT
Title: Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models
Authors: Simeng Han, Stephen Xia, Grant Zhang, Howard Dai, Chen Liu, Lichang Chen, Hoang Huy Nguyen, Hongyuan Mei, Jiayuan Mao, R. Thomas McCoy,
Abstract summary: We introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use.<n>Brainteasers can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force.
Score: 28.791905315055974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.

Related papers

Computational Thinking Reasoning in Large Language Models [69.28428524878885]
Computational Thinking Model (CTM) is a novel framework that incorporates computational thinking paradigms into large language models (LLMs)<n>Live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing.<n>CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability.
arXiv Detail & Related papers (2025-06-03T09:11:15Z)
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges [72.19809898215857]
We introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains.<n>These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports.<n>We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured, creative solutions, and generates well-grounded, creative solutions.
arXiv Detail & Related papers (2025-05-21T03:33:23Z)
Thinkless: LLM Learns When to Think [57.857534644932194]
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference.<n>We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning.<n>On several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%.
arXiv Detail & Related papers (2025-05-19T17:24:16Z)
BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
We introduce BloomWise, a new prompting technique, inspired by Bloom's taxonomy, to improve the performance of Large Language Models (LLMs) The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM. In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-10-05T09:27:52Z)
Benchmarking Language Model Creativity: A Case Study on Code Generation [39.546827184857754]
In this work, we introduce a framework for quantifying LLM creativity.<n>We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses.<n>We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions.
arXiv Detail & Related papers (2024-07-12T05:55:22Z)
Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal Examples [12.48027669682156]
Flow of Reasoning (FoR) aims to improve reasoning quality and diversity with minimal data.<n>FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph.<n>Experiments show that, with limited training examples, FoR enables the discovery of diverse, creative, high-quality solutions.
arXiv Detail & Related papers (2024-06-09T07:06:58Z)
Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs [2.3020018305241337]
Distilling explicit chain-of-thought reasoning paths has emerged as an effective method for improving the reasoning abilities of large language models. We propose a novel approach to distill reasoning abilities from LLMs by leveraging their capacity to explain solutions. Our experiments demonstrate that learning from explanations enables the Reasoner to more effectively guide program implementation by a Coder.
arXiv Detail & Related papers (2024-04-11T22:19:50Z)
MacGyver: Are Large Language Models Creative Problem Solvers? [87.70522322728581]
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting.<n>We create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems.<n>We present our collection to both LLMs and humans to compare and contrast their problem-solving abilities.
arXiv Detail & Related papers (2023-11-16T08:52:27Z)
SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving [64.38649623473626]
Large Language Models (LLMs) have driven substantial progress in artificial intelligence. We propose a novel framework called textbfSEquential subtextbfGoal textbfOptimization (SEGO) to enhance LLMs' ability to solve mathematical problems.
arXiv Detail & Related papers (2023-10-19T17:56:40Z)
Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs) Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z)
Conceptual Design Generation Using Large Language Models [0.34998703934432673]
Large Language Models (LLMs) can generate seemingly creative outputs from textual prompts. In this paper, we leverage LLMs to generate solutions for a set of 12 design problems and compare them to a baseline of crowdsourced solutions. Expert evaluations indicate that the LLM-generated solutions have higher average feasibility and usefulness. We experiment with prompt engineering and find that leveraging few-shot learning can lead to the generation of solutions that are more similar to the crowdsourced solutions.
arXiv Detail & Related papers (2023-05-30T19:32:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.