EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
- URL: http://arxiv.org/abs/2502.08859v2
- Date: Fri, 14 Feb 2025 16:40:15 GMT
- Title: EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
- Authors: Clinton J. Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean Hendryx, Summer Yue, Dan Hendrycks,
- Abstract summary: We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events.
This dataset probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning.
The benchmark comprises 1184 puzzles of varying complexity requiring teams of skilled solvers hours to days to complete.
- Score: 17.056693711040747
- License:
- Abstract: As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity -- each typically requiring teams of skilled solvers hours to days to complete -- with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity's Last Exam, unveiling models' shortcomings when challenged with problems requiring unstructured and lateral reasoning.
Related papers
- Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns? [57.80779199039929]
Large Language Models (LLMs) have demonstrated remarkable performance in solving math problems.
This paper introduces a novel benchmark, BeyondX, designed to address these limitations by incorporating problems with multiple unknowns.
Empirical study on BeyondX reveals that the performance of existing LLMs, even those fine-tuned specifically on math tasks, significantly decreases as the number of unknowns increases.
arXiv Detail & Related papers (2024-07-06T17:01:04Z) - OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.
These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.
Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z) - VC Search: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning [46.25056744404318]
We develop a benchmark called Problems with Missing and Contradictory conditions ( PMC) containing over 5,000 validated ill-defined mathematical problems.
VCSEARCH improves the accuracy of identifying unsolvable problems by at least 12% across different large language models.
arXiv Detail & Related papers (2024-06-07T16:24:12Z) - Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious
Challenges in Multimodal Reasoning [24.386388107656334]
This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering.
We present a new dataset, AlgoVQA, designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles.
arXiv Detail & Related papers (2024-03-06T17:15:04Z) - Puzzle Solving using Reasoning of Large Language Models: A Survey [1.9939549451457024]
This survey examines the capabilities of Large Language Models (LLMs) in puzzle solving.
Our findings highlight the disparity between LLM capabilities and human-like reasoning.
The survey underscores the necessity for novel strategies and richer datasets to advance LLMs' puzzle-solving proficiency.
arXiv Detail & Related papers (2024-02-17T14:19:38Z) - Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models [43.09706839884221]
Boosting of Thoughts (BoT) is an automated prompting framework for problem solving with Large Language Models.
We show that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.
arXiv Detail & Related papers (2024-02-17T00:13:36Z) - Automated Graph Genetic Algorithm based Puzzle Validation for Faster
Game Desig [69.02688684221265]
This paper presents an evolutionary algorithm, empowered by expert-knowledge informeds, for solving logical puzzles in video games efficiently.
We discuss multiple variations of hybrid genetic approaches for constraint satisfaction problems that allow us to find a diverse set of near-optimal solutions for puzzles.
arXiv Detail & Related papers (2023-02-17T18:15:33Z) - Are Deep Neural Networks SMARTer than Second Graders? [85.60342335636341]
We evaluate the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed for children in the 6--8 age group.
Our dataset consists of 101 unique puzzles; each puzzle comprises a picture question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning.
Experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization.
arXiv Detail & Related papers (2022-12-20T04:33:32Z) - Pushing the Limits of Rule Reasoning in Transformers through Natural
Language Satisfiability [30.01308882849197]
We propose a new methodology for creating challenging algorithmic reasoning datasets.
Key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language.
We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems.
arXiv Detail & Related papers (2021-12-16T17:47:20Z) - PuzzLing Machines: A Challenge on Learning From Small Data [64.513459448362]
We introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students.
Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages.
We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected.
arXiv Detail & Related papers (2020-04-27T20:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.