Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring
- URL: http://arxiv.org/abs/2502.07087v1
- Date: Mon, 10 Feb 2025 22:27:02 GMT
- Title: Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring
- Authors: Alex Heyman, Joel Zylberberg,
- Abstract summary: We investigate graph coloring as a means of evaluating an LLM's capacities for systematic step-by-step reasoning and possibility space exploration.
We test Claude 3.5 Sonnet, Llama 3.1 405B, Gemini 1.5 Pro, GPT-4o, o1-mini, and DeepSeek-R1 on a dataset of $k$-coloring problems.
- Score: 0.0
- License:
- Abstract: Contemporary large language models are powerful problem-solving tools, but they exhibit weaknesses in their reasoning abilities which ongoing research seeks to mitigate. We investigate graph coloring as a means of evaluating an LLM's capacities for systematic step-by-step reasoning and possibility space exploration, as well as effects of semantic problem framing. We test Claude 3.5 Sonnet, Llama 3.1 405B, Gemini 1.5 Pro, GPT-4o, o1-mini, and DeepSeek-R1 on a dataset of $k$-coloring problems with $2 \leq k \leq 4$ and vertex count $4 \leq n \leq 8$, using partial algorithmic solvers to further categorize problems by difficulty. In addition to substantial but varying framing effects, we find that all models except o1-mini and R1 exhibit $>60\%$ error rates on difficult problem types in all frames ($>15\%$ for o1-mini and $>10\%$ for R1), and no model achieves perfect accuracy even in the simple domain of 2-coloring 4-vertex graphs. Our results highlight both the considerable recent progress in LLM systematic reasoning and the limits of its reliability, especially in relation to increasing computational costs. We expect that more complex graph coloring problems, and procedural generation of arbitrary-complexity reasoning problems more broadly, offer further untapped potential for LLM benchmarking.
Related papers
- Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models [2.1894663332872932]
We investigate the capabilities of large multimodal models (LMMs) to solve graph and tree data structure problems based only on images.
GPT-4o and Gemini 1.5 Flash performed best on trees and graphs respectively.
Our findings highlight the influence of structural and visual variations on model performance.
arXiv Detail & Related papers (2024-12-15T07:15:19Z) - ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.
ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.
We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z) - GCoder: Improving Large Language Model for Generalized Graph Problem Solving [38.9131866084555]
Large Language Models (LLMs) have demonstrated strong reasoning abilities, making them suitable for complex tasks such as graph computation.
We introduce GCoder, a code-based LLM designed to enhance problem-solving in generalized graph problems.
Our method involves constructing an extensive training dataset, GraphWild, featuring diverse graph formats and algorithms.
arXiv Detail & Related papers (2024-10-24T18:40:36Z) - Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths.
Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance.
We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - GraphArena: Evaluating and Exploring Large Language Models on Graph Computation [38.65000765032749]
GraphArena is a tool designed to evaluate Large Language Models (LLMs) on real-world graph problems.
Evaluation of over 10 LLMs reveals that even top-performing LLMs struggle with larger, more complex graph problems.
We explore four potential solutions to address this issue, including chain-of-thought prompting, instruction tuning, code writing, and scaling test-time compute.
arXiv Detail & Related papers (2024-06-29T09:19:23Z) - Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types.
We show that even the most advanced LLMs fail to solve these problems end-to-end in text.
Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - Evaluating Large Language Models on Graphs: Performance Insights and
Comparative Analysis [7.099257763803159]
We evaluate the capabilities of four Large Language Models (LLMs) in addressing several analytical problems with graph data.
We employ four distinct evaluation metrics: Correctness, Fidelity, and Rectification.
GPT models can generate logical and coherent results, outperforming alternatives in correctness.
arXiv Detail & Related papers (2023-08-22T06:32:07Z) - Can Language Models Solve Graph Problems in Natural Language? [51.28850846990929]
Large language models (LLMs) are increasingly adopted for a variety of tasks with implicit graphical structures.
We propose NLGraph, a benchmark of graph-based problem solving simulating in natural language.
arXiv Detail & Related papers (2023-05-17T08:29:21Z) - Simultaneously Learning Stochastic and Adversarial Bandits with General
Graph Feedback [15.429356827868514]
We introduce a new trade-off mechanism for exploration and exploitation of general feedback graphs.
We prove the proposed algorithm simultaneously achieves $mathrmpoly-designed log T$ regret in the adversarial setting.
This is the first best-of-both-worlds result for general feedback graphs.
arXiv Detail & Related papers (2022-06-16T04:21:27Z) - The Performance of the MLE in the Bradley-Terry-Luce Model in
$\ell_{\infty}$-Loss and under General Graph Topologies [76.61051540383494]
We derive novel, general upper bounds on the $ell_infty$ estimation error of the Bradley-Terry-Luce model.
We demonstrate that the derived bounds perform well and in some cases are sharper compared to known results.
arXiv Detail & Related papers (2021-10-20T23:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.