StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
- URL: http://arxiv.org/abs/2510.02209v1
- Date: Thu, 02 Oct 2025 16:54:57 GMT
- Title: StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
- Authors: Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li,
- Abstract summary: Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents.<n>We introduce StockBench, a benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments.<n>Our evaluation shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively.
- Score: 44.10622904101254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.
Related papers
- AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models [23.493646150407116]
Current evaluations of real-time trading performance overlook a critical failure mode: severe behavioral instability in sequential decision-making under uncertainty.<n>We propose AlphaForgeBench, a principled framework that reframes Large Language Models (LLMs) as quantitative researchers rather than execution agents.
arXiv Detail & Related papers (2026-02-10T14:29:33Z) - Behavioral Consistency Validation for LLM Agents: An Analysis of Trading-Style Switching through Stock-Market Simulation [37.95724732592611]
We use a financial stock market scenario to test whether agents' strategy switching aligns with financial theory.<n>We operationalize four behavioral-finance drivers-loss aversion, herding, wealth differentiation, and price misalignment-as personality traits set via prompting and stored long-term.<n>Our results show that recent LLMs' switching behavior is only partially consistent with behavioral-finance theories.
arXiv Detail & Related papers (2026-02-02T09:25:10Z) - LiveTradeBench: Seeking Real-World Alpha with Large Language Models [26.976122048323873]
Large language models (LLMs) achieve strong performance across benchmarks.<n>These tests occur in static settings, lacking real dynamics and uncertainty.<n>We introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets.
arXiv Detail & Related papers (2025-11-05T16:47:26Z) - When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents [74.55061622246824]
Agent Market Arena (AMA) is the first lifelong, real-time benchmark for evaluating Large Language Model (LLM)-based trading agents.<n>AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework.<n>It evaluates agents across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash.
arXiv Detail & Related papers (2025-10-13T17:54:09Z) - Trade in Minutes! Rationality-Driven Agentic System for Quantitative Financial Trading [57.28635022507172]
TiMi is a rationality-driven multi-agent system that architecturally decouples strategy development from minute-level deployment.<n>We propose a two-tier analytical paradigm from macro patterns to micro customization, layered programming design for trading bot implementation, and closed-loop optimization driven by mathematical reflection.
arXiv Detail & Related papers (2025-10-06T13:08:55Z) - Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation [55.2788567621326]
We introduce a novel benchmark, FIN-FORCE-FINancial FORward Counterfactual Evaluation.<n>By curating financial news headlines, FIN-FORCE supports LLM based forward counterfactual generation.<n>This paves the way for scalable and automated solutions for exploring and anticipating future market developments.
arXiv Detail & Related papers (2025-05-26T02:41:50Z) - Can LLM-based Financial Investing Strategies Outperform the Market in Long Run? [5.968528974532717]
Large Language Models (LLMs) have been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data.<n>We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols.
arXiv Detail & Related papers (2025-05-11T18:02:21Z) - Will LLMs be Professional at Fund Investment? DeepFund: A Live Arena Perspective [10.932591941137698]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, but their effectiveness in financial decision-making remains inadequately evaluated.<n>We introduce DeepFund, a comprehensive arena platform for evaluating LLM-based trading strategies in a live environment.<n>Our approach implements a multi-agent framework where they serve as multiple key roles that realize the real-world investment decision processes.
arXiv Detail & Related papers (2025-03-24T03:32:13Z) - Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents [69.58565132975504]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks.<n>We present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive multi-agent trading.
arXiv Detail & Related papers (2025-02-25T08:41:01Z) - When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments [55.19252983108372]
We have developed a multi-agent AI system called StockAgent, driven by LLMs.
The StockAgent allows users to evaluate the impact of different external factors on investor trading.
It avoids the test set leakage issue present in existing trading simulation systems based on AI Agents.
arXiv Detail & Related papers (2024-07-15T06:49:30Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z) - Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models [51.3422222472898]
We document the capability of large language models (LLMs) like ChatGPT to predict stock price movements using news headlines.
We develop a theoretical model incorporating information capacity constraints, underreaction, limits-to-arbitrage, and LLMs.
arXiv Detail & Related papers (2023-04-15T19:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.