Interpretable Hypothesis-Driven Trading:A Rigorous Walk-Forward Validation Framework for Market Microstructure Signals
- URL: http://arxiv.org/abs/2512.12924v1
- Date: Mon, 15 Dec 2025 02:20:42 GMT
- Title: Interpretable Hypothesis-Driven Trading:A Rigorous Walk-Forward Validation Framework for Market Microstructure Signals
- Authors: Gagan Deep, Akash Deep, William Lamptey,
- Abstract summary: We develop a walk-forward validation framework for algorithmic trading designed to overfitting and lookahead bias.<n>Our methodology combines interpretable hypothesis-driven signal generation with reinforcement learning and strict out-of-sample testing.<n>The framework enforces strict information set discipline, employs rolling window validation across 34 independent test periods, maintains complete interpretability through natural language hypothesis explanations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We develop a rigorous walk-forward validation framework for algorithmic trading designed to mitigate overfitting and lookahead bias. Our methodology combines interpretable hypothesis-driven signal generation with reinforcement learning and strict out-of-sample testing. The framework enforces strict information set discipline, employs rolling window validation across 34 independent test periods, maintains complete interpretability through natural language hypothesis explanations, and incorporates realistic transaction costs and position constraints. Validating five market microstructure patterns across 100 US equities from 2015 to 2024, the system yields modest annualized returns (0.55%, Sharpe ratio 0.33) with exceptional downside protection (maximum drawdown -2.76%) and market-neutral characteristics (beta = 0.058). Performance exhibits strong regime dependence, generating positive returns during high-volatility periods (0.60% quarterly, 2020-2024) while underperforming in stable markets (-0.16%, 2015-2019). We report statistically insignificant aggregate results (p-value 0.34) to demonstrate a reproducible, honest validation protocol that prioritizes interpretability and extends naturally to advanced hypothesis generators, including large language models. The key empirical finding reveals that daily OHLCV-based microstructure signals require elevated information arrival and trading activity to function effectively. The framework provides complete mathematical specifications and open-source implementation, establishing a template for rigorous trading system evaluation that addresses the reproducibility crisis in quantitative finance research. For researchers, practitioners, and regulators, this work demonstrates that interpretable algorithmic trading strategies can be rigorously validated without sacrificing transparency or regulatory compliance.
Related papers
- The GT-Score: A Robust Objective Function for Reducing Overfitting in Data-Driven Trading Strategies [51.56484100374058]
GT-Score is a composite objective function that integrates performance, statistical significance, consistency, and downside risk.<n>In walk-forward validation, GT-Score improves the generalization ratio by 98% relative to baseline objective functions.<n>These results suggest that embedding an anti-overfitting structure into the objective can improve the reliability of backtests in quantitative research.
arXiv Detail & Related papers (2026-01-22T05:16:47Z) - Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents [0.7699235580548228]
LLM agents struggle with regulatory audit replay: when asked to reproduce a transaction flagged decision with identical inputs, most deployments fail to return consistent results.<n>This paper introduces the DeterminismFaithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using agents deployed in financial services.
arXiv Detail & Related papers (2026-01-17T19:47:55Z) - All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection [67.89888669159899]
RFC Bench is a benchmark for evaluating large language models on financial misinformation under realistic news.<n>The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis.
arXiv Detail & Related papers (2026-01-07T18:18:28Z) - Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving [65.02106674311908]
This paper introduces Intern-S1-MO, a long-horizon math agent that conducts multi-round hierarchical reasoning.<n>By maintaining a compact memory in the form of lemmas, Intern-S1-MO can more freely explore the lemma-rich reasoning spaces.<n> Experiments show that Intern-S1-MO can obtain 26 out of 35 points on the non-geometry problems of IMO2025, matching the performance of silver medalists.
arXiv Detail & Related papers (2025-12-11T15:26:28Z) - Inferring Latent Market Forces: Evaluating LLM Detection of Gamma Exposure Patterns via Obfuscation Testing [0.0937899315060426]
Tests three dealer hedging constraint patterns on 242 trading days (95.6% coverage) of S&P 500 options data.<n>We find LLMs achieve 71.5% detection rate using unbiased prompts that provide only raw gamma exposure values.
arXiv Detail & Related papers (2025-12-08T15:48:57Z) - Bayesian Modeling for Uncertainty Management in Financial Risk Forecasting and Compliance [0.0]
We develop an integrated approach that consistently enhances the handling of risk in market volatility forecasting, fraud detection, and compliance monitoring.<n>We evaluate the performance of one-day-ahead 95% Value-at-Risk (VaR) forecasts on daily S&P 500 returns, with a training period from 2000 to 2019 and an out-of-sample test period spanning 2020 to 2024.<n>Our proposed discount-factor DLM model produces a slightly liberal VaR estimate, with evidence of clustered violations.
arXiv Detail & Related papers (2025-12-06T23:00:19Z) - FinCARE: Financial Causal Analysis with Reasoning and Evidence [39.146761527401424]
Portfolio managers rely on correlation-based analysis and methods that fail to capture true causal relationships driving performance.<n>We present a hybrid framework that integrates statistical causal discovery algorithms with domain knowledge from two complementary sources: a financial knowledge graph extracted from SEC 10-K filings and large language model reasoning.
arXiv Detail & Related papers (2025-10-23T05:14:28Z) - Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation [7.3923284353934875]
We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs.<n>Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals.<n>Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.
arXiv Detail & Related papers (2025-10-15T16:55:56Z) - Formal Models and Convergence Analysis for Context-Aware Security Verification [0.0]
We present a formal framework for context-aware security verification that establishes provable guarantees for ML-enhanced adaptive systems.<n>We introduce context-completeness - a new security property - and prove: (1) sample complexity bounds showing when adaptive verification succeeds, (2) information-theoretic limits relating context richness to detection capability, and (3) convergence guarantees for ML-based payload generators.
arXiv Detail & Related papers (2025-10-14T12:21:36Z) - Revisiting Multivariate Time Series Forecasting with Missing Values [65.30332997607141]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z) - Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation [0.0]
We generate high-fidelity synthetic bond yield data for four major bond categories (AAA, BAA, US10Y,)<n>We employ a finetuned Large Language Model (LLM) Qwen2.5-7B that generates trading signals, risk assessments, and volatility projections.<n>The reinforcement learning-enhanced synthetic data generation achieves the least Mean Absolute Error of 0.103, demonstrating its effectiveness in replicating real-world bond market dynamics.
arXiv Detail & Related papers (2025-02-24T09:46:37Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.