How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use
- URL: http://arxiv.org/abs/2602.00528v1
- Date: Sat, 31 Jan 2026 05:45:25 GMT
- Title: How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use
- Authors: Minhua Lin, Enyan Dai, Hui Liu, Xianfeng Tang, Yuliang Yan, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Fali Wang, Hongcheng Gao, Chen Luo, Xiang Zhang, Qi He, Suhang Wang,
- Abstract summary: Large Language Models (LLMs) are increasingly applied in high-stakes domains.<n>LLMs fail to compete against traditional algorithms.<n>We propose ToolPoker, a tool-integrated reasoning framework.
- Score: 52.394999779049606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) are increasingly applied in high-stakes domains, their ability to reason strategically under uncertainty becomes critical. Poker provides a rigorous testbed, requiring not only strong actions but also principled, game-theoretic reasoning. In this paper, we conduct a systematic study of LLMs in multiple realistic poker tasks, evaluating both gameplay outcomes and reasoning traces. Our analysis reveals LLMs fail to compete against traditional algorithms and identifies three recurring flaws: reliance on heuristics, factual misunderstandings, and a "knowing-doing" gap where actions diverge from reasoning. An initial attempt with behavior cloning and step-level reinforcement learning improves reasoning style but remains insufficient for accurate game-theoretic play. Motivated by these limitations, we propose ToolPoker, a tool-integrated reasoning framework that combines external solvers for GTO-consistent actions with more precise professional-style explanations. Experiments demonstrate that ToolPoker achieves state-of-the-art gameplay while producing reasoning traces that closely reflect game-theoretic principles.
Related papers
- LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess [30.797553771114746]
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs)<n>We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including move quality, move legality, hallucinated actions, and game duration.<n>For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way.
arXiv Detail & Related papers (2025-12-01T18:51:08Z) - Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z) - Mastering Da Vinci Code: A Comparative Study of Transformer, LLM, and PPO-based Agents [0.0]
The Da Vinci Code, a game of logical deduction and imperfect information, presents unique challenges for artificial intelligence.<n>This paper investigates the efficacy of various AI paradigms in mastering this game.
arXiv Detail & Related papers (2025-06-15T10:33:30Z) - GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z) - TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs [45.12542636218608]
We propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization.<n>Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our benchmark.<n>To provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units.
arXiv Detail & Related papers (2024-10-14T13:15:34Z) - LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models [87.49676980090555]
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities.
We introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs.
arXiv Detail & Related papers (2024-08-28T13:16:41Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - Deep Reinforcement Learning with Stacked Hierarchical Attention for
Text-based Games [64.11746320061965]
We study reinforcement learning for text-based games, which are interactive simulations in the context of natural language.
We aim to conduct explicit reasoning with knowledge graphs for decision making, so that the actions of an agent are generated and supported by an interpretable inference procedure.
We extensively evaluate our method on a number of man-made benchmark games, and the experimental results demonstrate that our method performs better than existing text-based agents.
arXiv Detail & Related papers (2020-10-22T12:40:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.