Related papers: Evaluating LLM Understanding via Structured Tabular Decision Simulations

Evaluating LLM Understanding via Structured Tabular Decision Simulations

URL: http://arxiv.org/abs/2511.10667v1
Date: Fri, 07 Nov 2025 09:42:39 GMT
Title: Evaluating LLM Understanding via Structured Tabular Decision Simulations
Authors: Sichao Li, Xinyue Xu, Xiaomeng Li,
Abstract summary: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding.<n>We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings.<n>We analyze 9 frontier LLMs across 15 diverse decision settings.
Score: 19.626373589153108
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding. True LLM understanding, analogous to human expertise, requires making consistent, well-founded decisions across multiple instances and diverse domains, relying on relevant and domain-grounded decision factors. We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings that evaluate LLMs as if they were professionals undertaking structured decision ``exams''. In this context, understanding is defined as the ability to identify and rely on the correct decision factors, features that determine outcomes within a domain. STaDS jointly assesses understanding through: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that (a) most models struggle to achieve consistently strong accuracy across diverse domains; (b) models can be accurate yet globally unfaithful, and there are frequent mismatches between stated rationales and factors driving predictions. Our findings highlight the need for global-level understanding evaluation protocols and advocate for novel frameworks that go beyond accuracy to enhance LLMs' understanding ability.

Related papers

STRUX: An LLM for Decision-Making with Structured Explanations [17.518955158367305]
We introduce a new framework called STRUX, which enhances LLM decision-making by providing structured explanations. STRUX begins by distilling lengthy information into a concise table of key facts. It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision.
arXiv Detail & Related papers (2024-10-16T14:01:22Z)
DeFine: Decision-Making with Analogical Reasoning over Factor Profiles [35.9909472797192]
textscDeFine is a modular framework that constructs probabilistic factor profiles from complex scenarios.<n>It then integrates these profiles with analogical reasoning to guide LLMs in making critical decisions in new situations.<n>This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.
arXiv Detail & Related papers (2024-10-02T17:29:34Z)
Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making. Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations. Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z)
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established.<n>This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt.<n>We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Determinants of LLM-assisted Decision-Making [0.0]
Large Language Models (LLMs) provide multifaceted support in enhancing human decision-making processes. This study provides a structural overview and detailed analysis of determinants impacting decision-making with LLM support. Our findings can be seen as crucial for improving decision quality in human-AI collaboration.
arXiv Detail & Related papers (2024-02-27T10:24:50Z)
DeLLMa: Decision Making Under Uncertainty with Large Language Models [31.77731889916652]
DeLLMa is a framework designed to enhance decision-making accuracy in uncertain environments. We show that DeLLMa can consistently enhance the decision-making performance of leading language models, and achieve up to a 40% increase in accuracy over competing methods.
arXiv Detail & Related papers (2024-02-04T08:11:45Z)
Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap. We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.