Related papers: Tracing LLM Reasoning Processes with Strategic Games: A Framework for Planning, Revision, and Resource-Constrained Decision Making

Tracing LLM Reasoning Processes with Strategic Games: A Framework for Planning, Revision, and Resource-Constrained Decision Making

URL: http://arxiv.org/abs/2506.12012v1
Date: Fri, 13 Jun 2025 17:59:10 GMT
Title: Tracing LLM Reasoning Processes with Strategic Games: A Framework for Planning, Revision, and Resource-Constrained Decision Making
Authors: Xiaopeng Yuan, Xingjian Zhang, Ke Xu, Yifan Xu, Lijun Yu, Jindong Wang, Yushun Dong, Haohan Wang,
Abstract summary: Large language models (LLMs) are increasingly used for tasks that require complex reasoning.<n>We argue that measuring internal processes is essential for understanding model behavior and improving reliability.<n>We introduce a framework that evaluates LLMs along three core dimensions: planning, revision, and resource-constrained decision making.
Score: 38.75183725659772
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used for tasks that require complex reasoning. Most benchmarks focus on final outcomes but overlook the intermediate reasoning steps - such as planning, revision, and decision making under resource constraints. We argue that measuring these internal processes is essential for understanding model behavior and improving reliability. We propose using strategic games as a natural evaluation environment: closed, rule-based systems with clear states, limited resources, and automatic feedback. We introduce a framework that evaluates LLMs along three core dimensions: planning, revision, and resource-constrained decision making. To operationalize this, we define metrics beyond win rate, including overcorrection risk rate, correction success rate, improvement slope, and over-budget ratio. In 4320 adversarial rounds across 12 leading models, ChatGPT-o3-mini achieves the top composite score, with a win rate of 74.7 percent, a correction success rate of 78.6 percent, and an improvement slope of 0.041. By contrast, Qwen-Plus, despite an overcorrection risk rate of 81.6 percent, wins only 25.6 percent of its matches - primarily due to excessive resource use. We also observe a negative correlation between overcorrection risk rate and correction success rate (Pearson r = -0.51, p = 0.093), suggesting that more frequent edits do not always improve outcomes. Our findings highlight the value of assessing not only what LLMs decide but how they arrive at those decisions

Related papers

Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research [19.31559944205485]
Operations Research practitioners routinely debug infeasible models through an iterative process.<n>We introduce two benchmarks that place the textbfsolver in the evaluation loop<n>We find that domain-specific RLVR training enables an 8B model to surpass frontier APIs.
arXiv Detail & Related papers (2026-01-28T20:02:44Z)
Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems [0.29465623430708904]
Uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap.<n>We introduce Causal Judge Evaluation, a framework that fixes all three failures.
arXiv Detail & Related papers (2025-12-11T22:16:24Z)
Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering [52.69447404069251]
Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning.<n>We propose Prune-Then-Plan, a framework that stabilizes exploration through step-level calibration.
arXiv Detail & Related papers (2025-11-24T22:50:50Z)
Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People [81.63702981397408]
Given limited resources, to what extent do agents based on language models (LMs) act rationally?<n>We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior.<n>For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling)
arXiv Detail & Related papers (2025-10-23T17:57:28Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
FairReason: Balancing Reasoning and Social Bias in MLLMs [50.618158642714505]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z)
Evaluating the Sensitivity of LLMs to Prior Context [2.377922603550519]
Large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios.<n>We introduce a novel set of benchmarks that vary the volume and nature of prior context to measure sensitivity to contextual variations.<n>Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions.
arXiv Detail & Related papers (2025-05-29T16:09:32Z)
VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation [0.8087612190556891]
VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub and annotated by security experts.<n>For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weaknession (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan.<n>Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER.<n>Our results show that current state-of-the-
arXiv Detail & Related papers (2025-05-26T01:20:44Z)
Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities.<n>We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z)
Localization Meets Uncertainty: Uncertainty-Aware Multi-Modal Localization [5.414146574747448]
This study introduces a percentile-based rejection strategy that filters out unreliable 3-DoF pose predictions.<n> Experimental results show that applying stricter uncertainty thresholds consistently improves pose accuracy.
arXiv Detail & Related papers (2025-04-10T12:07:24Z)
Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models [63.559461750135334]
Language models (LMs) are increasingly used to build agents that can act autonomously to achieve goals.<n>We study this "answer-or-defer" problem with an evaluation framework that systematically varies human-specified risk structures.<n>We find that a simple skill-decomposition method, which isolates the independent skills required for answer-or-defer decision making, can consistently improve LMs' decision policies.
arXiv Detail & Related papers (2025-03-03T09:16:26Z)
Investigating Non-Transitivity in LLM-as-a-Judge [24.358802214160697]
We investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings.<n>To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments.
arXiv Detail & Related papers (2025-02-19T19:59:16Z)
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
Large Language Models (LLMs) show remarkable proficiency in natural language tasks.<n>Overconfidence-misalignment between predicted confidence and true correctness poses significant risks in critical decision-making applications.<n>We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering datasets.
arXiv Detail & Related papers (2025-02-16T07:46:09Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Making Large Language Models Better Planners with Reasoning-Decision Alignment [70.5381163219608]
We motivate an end-to-end decision-making model based on multimodality-augmented LLM. We propose a reasoning-decision alignment constraint between the paired CoTs and planning results. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver.
arXiv Detail & Related papers (2024-08-25T16:43:47Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs) We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.