AI Playing Business Games: Benchmarking Large Language Models on Managerial Decision-Making in Dynamic Simulations
- URL: http://arxiv.org/abs/2509.26331v1
- Date: Tue, 30 Sep 2025 14:43:05 GMT
- Title: AI Playing Business Games: Benchmarking Large Language Models on Managerial Decision-Making in Dynamic Simulations
- Authors: Berdymyrat Ovezmyradov,
- Abstract summary: This research analyses a novel benchmark using a business game for the decision making in business.<n>The research contributes to the recent literature on AI by proposing a reproducible, open-access management simulator.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The rapid advancement of LLMs sparked significant interest in their potential to augment or automate managerial functions. One of the most recent trends in AI benchmarking is performance of Large Language Models (LLMs) over longer time horizons. While LLMs excel at tasks involving natural language and pattern recognition, their capabilities in multi-step, strategic business decision-making remain largely unexplored. Few studies demonstrated how results can be different from benchmarks in short-term tasks, as Vending-Bench revealed. Meanwhile, there is a shortage of alternative benchmarks for long-term coherence. This research analyses a novel benchmark using a business game for the decision making in business. The research contributes to the recent literature on AI by proposing a reproducible, open-access management simulator to the research community for LLM benchmarking. This novel framework is used for evaluating the performance of five leading LLMs available in free online interface: Gemini, ChatGPT, Meta AI, Mistral AI, and Grok. LLM makes decisions for a simulated retail company. A dynamic, month-by-month management simulation provides transparently in spreadsheet model as experimental environment. In each of twelve months, the LLMs are provided with a structured prompt containing a full business report from the previous period and are tasked with making key strategic decisions: pricing, order size, marketing budget, hiring, dismissal, loans, training expense, R&D expense, sales forecast, income forecast The methodology is designed to compare the LLMs on quantitative metrics: profit, revenue, and market share, and other KPIs. LLM decisions are analyzed in their strategic coherence, adaptability to market changes, and the rationale provided for their decisions. This approach allows to move beyond simple performance metrics for assessment of the long-term decision-making.
Related papers
- AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z) - StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? [44.10622904101254]
Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents.<n>We introduce StockBench, a benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments.<n>Our evaluation shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively.
arXiv Detail & Related papers (2025-10-02T16:54:57Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? [66.87201770167012]
MLRC-Bench is a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions.<n>Unlike prior work, MLRC-Bench measures the key steps of proposing and implementing novel research methods.<n>Even the best-performing tested agent closes only 9.3% of the gap between baseline and top human participant scores.
arXiv Detail & Related papers (2025-04-13T19:35:43Z) - EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.<n>We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.<n>Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z) - A Reality check of the benefits of LLM in business [1.9181612035055007]
Large language models (LLMs) have achieved remarkable performance in language understanding and generation tasks.
This paper thoroughly examines the usefulness and readiness of LLMs for business processes.
arXiv Detail & Related papers (2024-06-09T02:36:00Z) - The Economic Implications of Large Language Model Selection on Earnings and Return on Investment: A Decision Theoretic Model [0.0]
We use a decision-theoretic approach to compare the financial impact of different language models.
The study reveals how the superior accuracy of more expensive models can, under certain conditions, justify a greater investment.
This article provides a framework for companies looking to optimize their technology choices.
arXiv Detail & Related papers (2024-05-27T20:08:41Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena [25.865825113847404]
We introduce AucArena, a novel evaluation suite that simulates auctions.
We conduct controlled experiments using state-of-the-art Large Language Models (LLMs) to power bidding agents to benchmark their planning and execution skills.
arXiv Detail & Related papers (2023-10-09T14:22:09Z) - Large Language Models in Finance: A Survey [12.243277149505364]
Large language models (LLMs) have opened new possibilities for artificial intelligence applications in finance.
Recent advances in large language models (LLMs) have opened new possibilities for artificial intelligence applications in finance.
arXiv Detail & Related papers (2023-09-28T06:04:04Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.