Related papers: TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?

TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?

URL: http://arxiv.org/abs/2603.00285v1
Date: Fri, 27 Feb 2026 20:06:28 GMT
Title: TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
Authors: Xiaochuang Yuan, Hui Xu, Silvia Xu, Cui Zou, Jing Xiong,
Abstract summary: TraderBench is a benchmark for evaluating AI agents in finance.<n>It combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations.<n>Two novel tracks: crypto trading with four progressive market-manipulation transforms, and options derivatives scoring across P&L accuracy, Greeks, and risk management.
Score: 8.661756660747042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific tasks. We introduce TraderBench, a benchmark that addresses both issues. It combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations scored purely on realized performance-Sharpe ratio, returns, and drawdown-eliminating judge variance entirely. The framework features two novel tracks: crypto trading with four progressive market-manipulation transforms, and options derivatives scoring across P&L accuracy, Greeks, and risk management. Trading scenarios can be refreshed with new market data to prevent benchmark contamination. Evaluating 13 models (8B open-source to frontier) on ~50 tasks, we find: (1) 8 of 13 models score ~33 on crypto with <1-point variation across adversarial conditions, exposing fixed non-adaptive strategies; (2) extended thinking helps retrieval (+26 points) but has zero impact on trading (+0.3 crypto, -0.1 options). These findings reveal that current agents lack genuine market adaptation, underscoring the need for performance-grounded evaluation in finance.

Related papers

PredictionMarketBench: A SWE-bench-Style Framework for Backtesting Trading Agents on Prediction Markets [0.0]
PredictionMarketBench is a SWE-bench-style benchmark for evaluating algorithmic and LLM-based trading agents on prediction markets.<n>PredictionMarketBench standardizes (i) episode construction from raw exchange streams (orderbooks, trades, lifecycle, settlement), (ii) an execution-realistic simulator with maker/taker semantics and fee modeling, and (iii) a tool-based agent interface.<n>We release four Kalshi-based episodes spanning cryptocurrency, weather, and sports. Baseline results show that naive trading agents can underperform due to transaction costs and settlement losses, while fee-aware algorithmic strategies remain competitive in volatile episodes
arXiv Detail & Related papers (2026-01-28T06:41:12Z)
TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful? [44.01987401527335]
TradeTrap is a unified evaluation framework for systematically stress-testing both adaptive and procedural autonomous trading agents.<n>It targets four core components of autonomous trading agents: market intelligence, strategy formulation, portfolio and ledger handling, and trade execution.<n>Experiments show that small perturbations at a single component can propagate through the agent decision loop and induce extreme concentration, runaway exposure, and large portfolio drawdowns.
arXiv Detail & Related papers (2025-12-01T23:06:42Z)
LiveTradeBench: Seeking Real-World Alpha with Large Language Models [26.976122048323873]
Large language models (LLMs) achieve strong performance across benchmarks.<n>These tests occur in static settings, lacking real dynamics and uncertainty.<n>We introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets.
arXiv Detail & Related papers (2025-11-05T16:47:26Z)
Robust Reinforcement Learning in Finance: Modeling Market Impact with Elliptic Uncertainty Sets [57.179679246370114]
In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices.<n>During deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact.<n>Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties.<n>We develop a novel class of elliptic uncertainty sets, enabling efficient and tractable robust policy evaluation.
arXiv Detail & Related papers (2025-10-22T18:22:25Z)
When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents [74.55061622246824]
Agent Market Arena (AMA) is the first lifelong, real-time benchmark for evaluating Large Language Model (LLM)-based trading agents.<n>AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework.<n>It evaluates agents across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash.
arXiv Detail & Related papers (2025-10-13T17:54:09Z)
Trade in Minutes! Rationality-Driven Agentic System for Quantitative Financial Trading [57.28635022507172]
TiMi is a rationality-driven multi-agent system that architecturally decouples strategy development from minute-level deployment.<n>We propose a two-tier analytical paradigm from macro patterns to micro customization, layered programming design for trading bot implementation, and closed-loop optimization driven by mathematical reflection.
arXiv Detail & Related papers (2025-10-06T13:08:55Z)
Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning [19.52468210547666]
Trading-R1 is a financially-aware model that incorporates strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making.<n>The system generates structured, evidence-based investment theses that support disciplined and interpretable trading decisions.
arXiv Detail & Related papers (2025-09-14T20:13:41Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents [69.58565132975504]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks.<n>We present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive multi-agent trading.
arXiv Detail & Related papers (2025-02-25T08:41:01Z)
When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments [55.19252983108372]
We have developed a multi-agent AI system called StockAgent, driven by LLMs. The StockAgent allows users to evaluate the impact of different external factors on investor trading. It avoids the test set leakage issue present in existing trading simulation systems based on AI Agents.
arXiv Detail & Related papers (2024-07-15T06:49:30Z)
MacroHFT: Memory Augmented Context-aware Reinforcement Learning On High Frequency Trading [20.3106468936159]
Reinforcement learning (RL) has become another appealing approach for high-frequency trading (HFT) We propose a novel Memory Augmented Context-aware Reinforcement learning method On HFT, empha.k.a. MacroHFT. We show that MacroHFT can achieve state-of-the-art performance on minute-level trading tasks.
arXiv Detail & Related papers (2024-06-20T17:48:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.