Related papers: AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models

AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models

URL: http://arxiv.org/abs/2602.18481v1
Date: Tue, 10 Feb 2026 14:29:33 GMT
Title: AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models
Authors: Wentao Zhang, Mingxuan Zhao, Jincheng Gao, Jieshun You, Huaiyu Jia, Yilei Zhao, Bo An, Shuo Sun,
Abstract summary: Current evaluations of real-time trading performance overlook a critical failure mode: severe behavioral instability in sequential decision-making under uncertainty.<n>We propose AlphaForgeBench, a principled framework that reframes Large Language Models (LLMs) as quantitative researchers rather than execution agents.
Score: 23.493646150407116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge tests to interactive trading simulations. However, current evaluations of real-time trading performance overlook a critical failure mode: severe behavioral instability in sequential decision-making under uncertainty. We empirically show that LLM-based trading agents exhibit extreme run-to-run variance, inconsistent action sequences even under deterministic decoding, and irrational action flipping across adjacent time steps. These issues stem from stateless autoregressive architectures lacking persistent action memory, as well as sensitivity to continuous-to-discrete action mappings in portfolio allocation. As a result, many existing financial trading benchmarks produce unreliable, non-reproducible, and uninformative evaluations. To address these limitations, we propose AlphaForgeBench, a principled framework that reframes LLMs as quantitative researchers rather than execution agents. Instead of emitting trading actions, LLMs generate executable alpha factors and factor-based strategies grounded in financial reasoning. This design decouples reasoning from execution, enabling fully deterministic and reproducible evaluation while aligning with real-world quantitative research workflows. Experiments across multiple state-of-the-art LLMs show that AlphaForgeBench eliminates execution-induced instability and provides a rigorous benchmark for assessing financial reasoning, strategy formulation, and alpha discovery.

Related papers

Behavioral Consistency Validation for LLM Agents: An Analysis of Trading-Style Switching through Stock-Market Simulation [37.95724732592611]
We use a financial stock market scenario to test whether agents' strategy switching aligns with financial theory.<n>We operationalize four behavioral-finance drivers-loss aversion, herding, wealth differentiation, and price misalignment-as personality traits set via prompting and stored long-term.<n>Our results show that recent LLMs' switching behavior is only partially consistent with behavioral-finance theories.
arXiv Detail & Related papers (2026-02-02T09:25:10Z)
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning [2.1461777157838724]
We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in large language models (LLMs) reasoning.<n>Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability.<n>We further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability.
arXiv Detail & Related papers (2025-12-08T18:26:58Z)
LiveTradeBench: Seeking Real-World Alpha with Large Language Models [26.976122048323873]
Large language models (LLMs) achieve strong performance across benchmarks.<n>These tests occur in static settings, lacking real dynamics and uncertainty.<n>We introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets.
arXiv Detail & Related papers (2025-11-05T16:47:26Z)
Robust Reinforcement Learning in Finance: Modeling Market Impact with Elliptic Uncertainty Sets [57.179679246370114]
In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices.<n>During deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact.<n>Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties.<n>We develop a novel class of elliptic uncertainty sets, enabling efficient and tractable robust policy evaluation.
arXiv Detail & Related papers (2025-10-22T18:22:25Z)
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? [44.10622904101254]
Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents.<n>We introduce StockBench, a benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments.<n>Our evaluation shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively.
arXiv Detail & Related papers (2025-10-02T16:54:57Z)
Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation [55.2788567621326]
We introduce a novel benchmark, FIN-FORCE-FINancial FORward Counterfactual Evaluation.<n>By curating financial news headlines, FIN-FORCE supports LLM based forward counterfactual generation.<n>This paves the way for scalable and automated solutions for exploring and anticipating future market developments.
arXiv Detail & Related papers (2025-05-26T02:41:50Z)
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning [27.449948943467163]
We propose a Token-level Uncertainty estimation framework for Reasoning (TokUR)<n>TokUR enables Large Language Models to self-assess and self-improve their responses in mathematical reasoning.<n> Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness.
arXiv Detail & Related papers (2025-05-16T22:47:32Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
Large language models (LLMs) are becoming more capable and widespread.<n>Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks.<n>In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.