True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3
and Challenging for GPT-4
- URL: http://arxiv.org/abs/2212.10114v2
- Date: Thu, 1 Jun 2023 18:50:21 GMT
- Title: True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3
and Challenging for GPT-4
- Authors: Maksym Del and Mark Fishel
- Abstract summary: Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities.
In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles.
We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated solid zero-shot reasoning
capabilities, which is reflected in their performance on the current test
tasks. This calls for a more challenging benchmark requiring highly advanced
reasoning ability to be solved. In this paper, we introduce such a benchmark,
consisting of 191 long-form (1200 words on average) mystery narratives
constructed as detective puzzles. Puzzles are sourced from the "5 Minute
Mystery" platform and include a multiple-choice question for evaluation. Only
47% of humans solve a puzzle successfully on average, while the best human
solvers achieve over 80% success rate. We show that GPT-3 models barely
outperform random on this benchmark (with 28% accuracy) while state-of-the-art
GPT-4 solves only 38% of puzzles. This indicates that there is still a
significant gap in the deep reasoning abilities of LLMs and humans and
highlights the need for further research in this area. Our work introduces a
challenging benchmark for future studies on reasoning in language models and
contributes to a better understanding of the limits of LLMs' abilities.
Related papers
- On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.
One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.
We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored.
This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma.
Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z) - LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models.
We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages.
We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z) - Are LLMs Good Cryptic Crossword Solvers? [4.463184061618504]
Cryptic crosswords are puzzles that rely on the solver's ability to manipulate language on different levels and deal with various types of wordplay.
Previous research suggests that solving such puzzles is a challenge even for modern NLP models.
arXiv Detail & Related papers (2024-03-15T06:57:08Z) - REBUS: A Robust Evaluation Benchmark of Understanding Symbols [1.90463290938268]
GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models.
Even the best model has a final accuracy of only 42%, which goes down to just 7% on hard puzzles.
Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.
arXiv Detail & Related papers (2024-01-11T00:30:28Z) - Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs)
We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam.
Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z) - Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT)
ToT generalizes over the popular Chain of Thought approach to prompting language models.
Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z) - Evaluating Large Language Models in Theory of Mind Tasks [11.622327857276389]
Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks.
The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario.
To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios.
arXiv Detail & Related papers (2023-02-04T03:50:01Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z) - PuzzLing Machines: A Challenge on Learning From Small Data [64.513459448362]
We introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students.
Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages.
We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected.
arXiv Detail & Related papers (2020-04-27T20:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.