Evaluating LLM Understanding via Structured Tabular Decision Simulations
- URL: http://arxiv.org/abs/2511.10667v1
- Date: Fri, 07 Nov 2025 09:42:39 GMT
- Title: Evaluating LLM Understanding via Structured Tabular Decision Simulations
- Authors: Sichao Li, Xinyue Xu, Xiaomeng Li,
- Abstract summary: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding.<n>We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings.<n>We analyze 9 frontier LLMs across 15 diverse decision settings.
- Score: 19.626373589153108
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding. True LLM understanding, analogous to human expertise, requires making consistent, well-founded decisions across multiple instances and diverse domains, relying on relevant and domain-grounded decision factors. We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings that evaluate LLMs as if they were professionals undertaking structured decision ``exams''. In this context, understanding is defined as the ability to identify and rely on the correct decision factors, features that determine outcomes within a domain. STaDS jointly assesses understanding through: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that (a) most models struggle to achieve consistently strong accuracy across diverse domains; (b) models can be accurate yet globally unfaithful, and there are frequent mismatches between stated rationales and factors driving predictions. Our findings highlight the need for global-level understanding evaluation protocols and advocate for novel frameworks that go beyond accuracy to enhance LLMs' understanding ability.
Related papers
- STRUX: An LLM for Decision-Making with Structured Explanations [17.518955158367305]
We introduce a new framework called STRUX, which enhances LLM decision-making by providing structured explanations.
STRUX begins by distilling lengthy information into a concise table of key facts.
It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision.
arXiv Detail & Related papers (2024-10-16T14:01:22Z) - DeFine: Decision-Making with Analogical Reasoning over Factor Profiles [35.9909472797192]
textscDeFine is a modular framework that constructs probabilistic factor profiles from complex scenarios.<n>It then integrates these profiles with analogical reasoning to guide LLMs in making critical decisions in new situations.<n>This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.
arXiv Detail & Related papers (2024-10-02T17:29:34Z) - Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making.
Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations.
Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z) - Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established.<n>This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt.<n>We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Determinants of LLM-assisted Decision-Making [0.0]
Large Language Models (LLMs) provide multifaceted support in enhancing human decision-making processes.
This study provides a structural overview and detailed analysis of determinants impacting decision-making with LLM support.
Our findings can be seen as crucial for improving decision quality in human-AI collaboration.
arXiv Detail & Related papers (2024-02-27T10:24:50Z) - DeLLMa: Decision Making Under Uncertainty with Large Language Models [31.77731889916652]
DeLLMa is a framework designed to enhance decision-making accuracy in uncertain environments.
We show that DeLLMa can consistently enhance the decision-making performance of leading language models, and achieve up to a 40% increase in accuracy over competing methods.
arXiv Detail & Related papers (2024-02-04T08:11:45Z) - Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap.
We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert.
Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.