PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models
- URL: http://arxiv.org/abs/2502.01584v3
- Date: Mon, 31 Mar 2025 14:21:49 GMT
- Title: PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models
- Authors: Zixuan Wu, Francesca Lucchetti, Aleksander Boruch-Gruszecki, Jingmiao Zhao, Carolyn Jane Anderson, Joydeep Biswas, Federico Cassano, Molly Q Feldman, Arjun Guha,
- Abstract summary: Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp.<n>We present a benchmark with 594 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge.<n>Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot.
- Score: 41.85078638790154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 594 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with "I give up" before providing an answer that it knows is wrong. R1 can also be remarkably "uncertain" in its output and in rare cases, it does not "finish thinking," which suggests the need for techniques to "wrap up" before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.
Related papers
- THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists.
We find that in general, reasoning models are poorly calibrated, particularly on easy problems.
We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z) - DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs [3.850766603072179]
We introduce Don't Reason Bench (DNR Bench) to evaluate large language models (LLMs)
DNR Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to.
Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy.
arXiv Detail & Related papers (2025-03-20T02:19:14Z) - MastermindEval: A Simple But Scalable Reasoning Benchmark [3.5519847710183674]
We introduce MastermindEval, a simple, scalable, and interpretable deductive reasoning benchmark inspired by the board game Mastermind.
Our benchmark supports two evaluation paradigms: (1) agentic evaluation, in which the model autonomously plays the game, and (2) deductive reasoning evaluation, in which the model is given a pre-played game state with only one possible valid code to infer.
arXiv Detail & Related papers (2025-03-07T19:24:59Z) - Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.
Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.
We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs? [32.972545797220924]
Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks.<n> GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities.
arXiv Detail & Related papers (2024-12-13T11:38:10Z) - A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets.
We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z) - One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books.
Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify.
On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs)
We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam.
Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.