Related papers: Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning

Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning

URL: http://arxiv.org/abs/2403.03864v3
Date: Wed, 13 Mar 2024 00:50:05 GMT
Title: Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning
Authors: Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, Soujanya Poria
Abstract summary: This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoVQA, designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles.
Score: 24.386388107656334
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems.

Related papers

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [17.056693711040747]
We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events. This dataset probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning. The benchmark comprises 1184 puzzles of varying complexity requiring teams of skilled solvers hours to days to complete.
arXiv Detail & Related papers (2025-02-13T00:18:34Z)
Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization [56.17811386955609]
Graph-structured challenges are inherently difficult due to their nonlinear and intricate nature. In this study, we propose transforming graphs into images to preserve their higher-order structural features accurately. By combining the innovative paradigm powered by multimodal large language models with simple search techniques, we aim to develop a novel and effective framework.
arXiv Detail & Related papers (2025-01-21T08:28:10Z)
Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems [25.0042181817455]
We introduce a multi-agent system, ZPS, that integrates Large Language Models with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study.
arXiv Detail & Related papers (2024-07-04T14:22:25Z)
Solving Witness-type Triangle Puzzles Faster with an Automatically Learned Human-Explainable Predicate [0.29005223064604074]
We develop a search-based artificial intelligence puzzle solver for The Witness game. We learn a human-explainable predicate that predicts whether a partial path to a Witness-type puzzle is not completable to a solution path. We prove a key property of the learned predicate which allows us to use it for pruning successor states in search.
arXiv Detail & Related papers (2023-08-04T18:52:18Z)
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks [59.26515696183751]
We show that algorithm discovery in neural networks is sometimes more complex. We show that even simple learning problems can admit a surprising diversity of solutions.
arXiv Detail & Related papers (2023-06-30T17:59:13Z)
Solving and Generating NPR Sunday Puzzles with Large Language Models [0.0]
State-of-the-art large language models can solve many PUZZLEQA puzzles. The best model achieves, GPT-3.5, 50.2% loose accuracy. GPT-3.5 generates puzzles with answers that do not conform to the generated rules.
arXiv Detail & Related papers (2023-06-21T13:23:48Z)
Automated Graph Genetic Algorithm based Puzzle Validation for Faster Game Desig [69.02688684221265]
This paper presents an evolutionary algorithm, empowered by expert-knowledge informeds, for solving logical puzzles in video games efficiently. We discuss multiple variations of hybrid genetic approaches for constraint satisfaction problems that allow us to find a diverse set of near-optimal solutions for puzzles.
arXiv Detail & Related papers (2023-02-17T18:15:33Z)
Are Deep Neural Networks SMARTer than Second Graders? [85.60342335636341]
We evaluate the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning. Experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization.
arXiv Detail & Related papers (2022-12-20T04:33:32Z)
Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task. Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z)
Using Small MUSes to Explain How to Solve Pen and Paper Puzzles [4.535832029902474]
We present Demystify, a tool which allows puzzles to be expressed in a high-level constraint programming language. We give several improvements to the existing techniques for solving puzzles with MUSes. We demonstrate the effectiveness and generality of Demystify by comparing its results to documented strategies for solving a range of pen and paper puzzles by hand.
arXiv Detail & Related papers (2021-04-30T15:07:51Z)
PuzzLing Machines: A Challenge on Learning From Small Data [64.513459448362]
We introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students. Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages. We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected.
arXiv Detail & Related papers (2020-04-27T20:34:26Z)
Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning [95.18337034090648]
We propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG) These visual arithmetic problems are in the form of geometric figures. We benchmark the MNS dataset using four predominant neural network models as baselines in this visual reasoning task.
arXiv Detail & Related papers (2020-04-25T17:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.