Related papers: AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager

AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager

URL: http://arxiv.org/abs/2508.11416v1
Date: Fri, 15 Aug 2025 11:38:19 GMT
Title: AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager
Authors: Xuhua Zhao, Yuxuan Xie, Caihua Chen, Yuxiang Sun,
Abstract summary: AIM-Bench is a novel benchmark designed to assess the decision-making behaviour of large language models (LLMs) in uncertain supply chain management scenarios.<n>Our results reveal that different LLMs typically exhibit varying degrees of decision bias that are similar to those observed in human beings.
Score: 9.21215885702746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in mathematical reasoning and the long-term planning capabilities of large language models (LLMs) have precipitated the development of agents, which are being increasingly leveraged in business operations processes. Decision models to optimize inventory levels are one of the core elements of operations management. However, the capabilities of the LLM agent in making inventory decisions in uncertain contexts, as well as the decision-making biases (e.g. framing effect, etc.) of the agent, remain largely unexplored. This prompts concerns regarding the capacity of LLM agents to effectively address real-world problems, as well as the potential implications of biases that may be present. To address this gap, we introduce AIM-Bench, a novel benchmark designed to assess the decision-making behaviour of LLM agents in uncertain supply chain management scenarios through a diverse series of inventory replenishment experiments. Our results reveal that different LLMs typically exhibit varying degrees of decision bias that are similar to those observed in human beings. In addition, we explored strategies to mitigate the pull-to-centre effect and the bullwhip effect, namely cognitive reflection and implementation of information sharing. These findings underscore the need for careful consideration of the potential biases in deploying LLMs in Inventory decision-making scenarios. We hope that these insights will pave the way for mitigating human decision bias and developing human-centred decision support systems for supply chains.

Related papers

Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making [28.256934953904317]
We study whether large language models (LLMs) are rational utility maximizers with coherent beliefs and stable preferences.<n>Our approach provides falsifiable conditions under which the reported probabilities emphcannot correspond to the true beliefs of any rational agent.
arXiv Detail & Related papers (2026-02-06T00:50:33Z)
AI Agent Systems for Supply Chains: Structured Decision Prompts and Memory Retrieval [3.3703751888858675]
This study investigates large language model (LLM) -based multi-agent systems (MASs) as a promising approach to inventory management.<n>It is unclear whether LLM-based MASs can consistently derive optimal ordering policies and adapt to diverse supply chain scenarios.
arXiv Detail & Related papers (2026-02-05T10:35:00Z)
Feedback-Induced Performance Decline in LLM-Based Decision-Making [6.5990946334144756]
Large Language Models (LLMs) can extract context from natural language problem descriptions.<n>This paper studies the behaviour of these models within a Markov Decision Process (MDPs)
arXiv Detail & Related papers (2025-07-20T10:38:56Z)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.<n>Existing evaluations tend to rely solely on a final success rate.<n>We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z)
Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making. Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations. Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context [5.361970694197912]
This paper proposes a framework, grounded in behavioral economics, to evaluate the decision-making behaviors of large language models (LLMs) We estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities.
arXiv Detail & Related papers (2024-06-10T02:14:19Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
Rational Decision-Making Agent with Internalized Utility Judgment [88.01612847081677]
Large language models (LLMs) have demonstrated remarkable advancements and have attracted significant efforts to develop LLMs into agents capable of executing intricate multi-step decision-making tasks beyond traditional NLP applications.<n>This paper proposes RadAgent, which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning.<n> Experimental results on the ToolBench dataset demonstrate RadAgent's superiority over baselines, achieving over 10% improvement in Pass Rate on diverse tasks.
arXiv Detail & Related papers (2023-08-24T03:11:45Z)
Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies [79.60322329952453]
We show how to develop interpretable representations of how agents make decisions. By understanding the decision-making processes underlying a set of observed trajectories, we cast the policy inference problem as the inverse to this online learning problem. We introduce a practical algorithm for retrospectively estimating such perceived effects, alongside the process through which agents update them. Through application to the analysis of UNOS organ donation acceptance decisions, we demonstrate that our approach can bring valuable insights into the factors that govern decision processes and how they change over time.
arXiv Detail & Related papers (2022-03-14T17:40:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.