PredictionMarketBench: A SWE-bench-Style Framework for Backtesting Trading Agents on Prediction Markets
- URL: http://arxiv.org/abs/2602.00133v1
- Date: Wed, 28 Jan 2026 06:41:12 GMT
- Title: PredictionMarketBench: A SWE-bench-Style Framework for Backtesting Trading Agents on Prediction Markets
- Authors: Avi Arora, Ritesh Malpani,
- Abstract summary: PredictionMarketBench is a SWE-bench-style benchmark for evaluating algorithmic and LLM-based trading agents on prediction markets.<n>PredictionMarketBench standardizes (i) episode construction from raw exchange streams (orderbooks, trades, lifecycle, settlement), (ii) an execution-realistic simulator with maker/taker semantics and fee modeling, and (iii) a tool-based agent interface.<n>We release four Kalshi-based episodes spanning cryptocurrency, weather, and sports. Baseline results show that naive trading agents can underperform due to transaction costs and settlement losses, while fee-aware algorithmic strategies remain competitive in volatile episodes
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prediction markets offer a natural testbed for trading agents: contracts have binary payoffs, prices can be interpreted as probabilities, and realized performance depends critically on market microstructure, fees, and settlement risk. We introduce PredictionMarketBench, a SWE-bench-style benchmark for evaluating algorithmic and LLM-based trading agents on prediction markets via deterministic, event-driven replay of historical limit-order-book and trade data. PredictionMarketBench standardizes (i) episode construction from raw exchange streams (orderbooks, trades, lifecycle, settlement), (ii) an execution-realistic simulator with maker/taker semantics and fee modeling, and (iii) a tool-based agent interface that supports both classical strategies and tool-calling LLM agents with reproducible trajectories. We release four Kalshi-based episodes spanning cryptocurrency, weather, and sports. Baseline results show that naive trading agents can underperform due to transaction costs and settlement losses, while fee-aware algorithmic strategies remain competitive in volatile episodes.
Related papers
- TraderBench: How Robust Are AI Agents in Adversarial Capital Markets? [8.661756660747042]
TraderBench is a benchmark for evaluating AI agents in finance.<n>It combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations.<n>Two novel tracks: crypto trading with four progressive market-manipulation transforms, and options derivatives scoring across P&L accuracy, Greeks, and risk management.
arXiv Detail & Related papers (2026-02-27T20:06:28Z) - Forecasting Future Language: Context Design for Mention Markets [81.25011140991566]
We study how input context should be designed to support accurate prediction in mention markets.<n>We find three insights: (1) richer context consistently improves forecasting performance; (2) market-conditioned prompting (MCP) treats the market probability as a prior and updates it using textual evidence, yields better-calibrated forecasts; and (3) a mixture of the market probability and MCP (MixMCP) outperforms the market baseline.
arXiv Detail & Related papers (2026-02-04T12:43:31Z) - Robust Reinforcement Learning in Finance: Modeling Market Impact with Elliptic Uncertainty Sets [57.179679246370114]
In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices.<n>During deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact.<n>Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties.<n>We develop a novel class of elliptic uncertainty sets, enabling efficient and tractable robust policy evaluation.
arXiv Detail & Related papers (2025-10-22T18:22:25Z) - When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents [74.55061622246824]
Agent Market Arena (AMA) is the first lifelong, real-time benchmark for evaluating Large Language Model (LLM)-based trading agents.<n>AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework.<n>It evaluates agents across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash.
arXiv Detail & Related papers (2025-10-13T17:54:09Z) - To Trade or Not to Trade: An Agentic Approach to Estimating Market Risk Improves Trading Decisions [0.0]
Large language models (LLMs) are increasingly deployed in agentic frameworks.<n>We develop an agentic system that uses LLMs to iteratively discover differential equations for financial time series.<n>We find that model-informed trading strategies outperform standard LLM-based agents.
arXiv Detail & Related papers (2025-07-11T13:29:32Z) - Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents [69.58565132975504]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks.<n>We present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive multi-agent trading.
arXiv Detail & Related papers (2025-02-25T08:41:01Z) - When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments [55.19252983108372]
We have developed a multi-agent AI system called StockAgent, driven by LLMs.
The StockAgent allows users to evaluate the impact of different external factors on investor trading.
It avoids the test set leakage issue present in existing trading simulation systems based on AI Agents.
arXiv Detail & Related papers (2024-07-15T06:49:30Z) - Diffusion Variational Autoencoder for Tackling Stochasticity in
Multi-Step Regression Stock Price Prediction [54.21695754082441]
Multi-step stock price prediction over a long-term horizon is crucial for forecasting its volatility.
Current solutions to multi-step stock price prediction are mostly designed for single-step, classification-based predictions.
We combine a deep hierarchical variational-autoencoder (VAE) and diffusion probabilistic techniques to do seq2seq stock prediction.
Our model is shown to outperform state-of-the-art solutions in terms of its prediction accuracy and variance.
arXiv Detail & Related papers (2023-08-18T16:21:15Z) - Data Cross-Segmentation for Improved Generalization in Reinforcement
Learning Based Algorithmic Trading [5.75899596101548]
We propose a Reinforcement Learning (RL) algorithm that trades based on signals from a learned predictive model.
We test our algorithm on 20+ years of equity data from Bursa Malaysia.
arXiv Detail & Related papers (2023-07-18T16:00:02Z) - Predicting Status of Pre and Post M&A Deals Using Machine Learning and
Deep Learning Techniques [0.0]
Risk arbitrage or merger arbitrage is an investment strategy that speculates on the success of M&A deals.
Prediction of the deal status in advance is of great importance for risk arbitrageurs.
We present an ML and DL based methodology for takeover success prediction problem.
arXiv Detail & Related papers (2021-08-05T21:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.