Related papers: RiddleBench: A New Generative Reasoning Benchmark for LLMs

RiddleBench: A New Generative Reasoning Benchmark for LLMs

URL: http://arxiv.org/abs/2510.24932v1
Date: Tue, 28 Oct 2025 19:58:24 GMT
Title: RiddleBench: A New Generative Reasoning Benchmark for LLMs
Authors: Deepon Halder, Alan Saji, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre,
Abstract summary: Large Language Models have demonstrated strong performance on many established reasoning benchmarks.<n>RiddleBench is a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities.<n> Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses.
Score: 23.638413274414276
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.

Related papers

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning [61.04601861108966]
We propose MorphoBench, a benchmark that incorporates multidisciplinary questions to evaluate the reasoning capabilities of large models.<n>MorphoBench adaptively modifies the analytical challenge of questions by leveraging key statements generated during the model's reasoning process.<n>We have gathered over 1,300 test questions and iteratively adjusted the difficulty of MorphoBench based on the reasoning capabilities of models such as o3 and GPT-5.
arXiv Detail & Related papers (2025-10-16T03:30:56Z)
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models [49.92148175114169]
We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions.<n>Models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states.<n>Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely.
arXiv Detail & Related papers (2025-10-15T14:51:36Z)
Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know? [7.423494663010787]
Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks.<n>Like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect.<n>Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications.
arXiv Detail & Related papers (2025-06-22T21:46:42Z)
Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z)
The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models [54.88805865447848]
We show that instruct models achieve higher efficiency overall, and problem difficulty affects efficiency.<n>We propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it.<n>On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
arXiv Detail & Related papers (2025-05-28T06:24:45Z)
Thought calibration: Efficient and confident test-time scaling [11.028893528095196]
Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost.<n>We propose thought calibration to decide dynamically when thinking can be terminated.<n>We realize this framework through lightweight probes that operate on top of the language model's hidden representations.
arXiv Detail & Related papers (2025-05-23T22:17:18Z)
Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens [51.90059610606049]
This paper revisits the efficiency of such reasoning processes through an information-theoretic lens.<n>We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution.<n>Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high.
arXiv Detail & Related papers (2025-05-23T13:38:56Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.931194824519935]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises [41.39610589639382]
We present RuozhiBench, a dataset containing 677 carefully curated questions that contain various forms of deceptive reasoning.<n>We evaluate 17 large language models (LLMs) from 5 Series over RuozhiBench using both open-ended and two-choice formats.<n>LLMs showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.
arXiv Detail & Related papers (2025-02-18T18:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.