Related papers: Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

URL: http://arxiv.org/abs/2602.19914v1
Date: Mon, 23 Feb 2026 14:54:38 GMT
Title: Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning
Authors: Thatchawin Leelawat, Lewis D Griffin,
Abstract summary: Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts.<n>We present a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended questions and unconstrained language responses.<n>Results show a clear improvement in AI model performance over time.
Score: 1.094320514634939
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes detective tabletop game as a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended questions and unconstrained language responses. An automated grading system was developed and validated against human assessors to enable scalable and replicable performance evaluation. Results show a clear improvement in AI model performance over time. Over nine months of 2025, model performance rose from the lower quartile of the human comparison group to approximately the top 5%. Around half of this improvement reflects steady advancement across successive model releases, while the remainder corresponds to a marked step change associated with reasoning-oriented model architectures. Systematic differences in the performance of AI models compared to humans, dependent on features of the specific detection puzzle, were mostly absent with the exception of a fall in performance for models when solving longer cases (case lengths being in the range of 1900-4000 words), and an advantage at inductive reasoning for reasoning models at early stages of case solving when evidence was scant.

Related papers

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z)
The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility [0.0]
Models achieving above-average human IQ scores simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks.<n>This disconnect appears most strongly in the crystallized intelligence domain.<n>We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.
arXiv Detail & Related papers (2025-11-23T05:49:57Z)
The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation [1.2324085268373774]
We discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure?<n>We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities evolve over the years.
arXiv Detail & Related papers (2025-11-03T09:09:29Z)
ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z)
Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1 [0.0]
We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning.<n>We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity's Last Exam.
arXiv Detail & Related papers (2025-08-13T20:15:20Z)
Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z)
Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability [16.441081996257576]
We propose leveraging reasoning-intensive models to improve less computationally demanding, non-reasoning models.<n>We demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
arXiv Detail & Related papers (2025-04-13T16:26:56Z)
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks [47.40240774236047]
We compare four Chat Llama 2 models against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators.<n>Most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences.
arXiv Detail & Related papers (2025-02-24T01:01:02Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models [44.42887452269389]
Detectives frequently engage in information detection and reasoning simultaneously when making decisions across various cases. We introduce the DetectBench, a reading comprehension dataset designed to assess a model's ability to jointly ability in key information detection and multi-hop reasoning. To enhance model's detective skills, we propose the Detective Thinking Framework. These methods encourage models to identify all possible clues within the context before reasoning.
arXiv Detail & Related papers (2023-07-11T08:45:46Z)
To what extent do human explanations of model behavior align with actual model behavior? [91.67905128825402]
We investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. We defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words. We find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI.
arXiv Detail & Related papers (2020-12-24T17:40:06Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.