Related papers: Generative AI for End-to-End Limit Order Book Modelling: A Token-Level Autoregressive Generative Model of Message Flow Using a Deep State Space Network

Generative AI for End-to-End Limit Order Book Modelling: A Token-Level Autoregressive Generative Model of Message Flow Using a Deep State Space Network

URL: http://arxiv.org/abs/2309.00638v1
Date: Wed, 23 Aug 2023 09:37:22 GMT
Title: Generative AI for End-to-End Limit Order Book Modelling: A Token-Level Autoregressive Generative Model of Message Flow Using a Deep State Space Network
Authors: Peer Nagy, Sascha Frey, Silvia Sapora, Kang Li, Anisoara Calinescu, Stefan Zohren, Jakob Foerster
Abstract summary: We propose an end-to-end autoregressive generative model that generates tokenized limit order book (LOB) messages. Using NASDAQ equity LOBs, we develop a custom tokenizer for message data, converting groups of successive digits to tokens. Results show promising performance in approximating the data distribution, as evidenced by low model perplexity.
Score: 7.54290390842336
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developing a generative model of realistic order flow in financial markets is a challenging open problem, with numerous applications for market participants. Addressing this, we propose the first end-to-end autoregressive generative model that generates tokenized limit order book (LOB) messages. These messages are interpreted by a Jax-LOB simulator, which updates the LOB state. To handle long sequences efficiently, the model employs simplified structured state-space layers to process sequences of order book states and tokenized messages. Using LOBSTER data of NASDAQ equity LOBs, we develop a custom tokenizer for message data, converting groups of successive digits to tokens, similar to tokenization in large language models. Out-of-sample results show promising performance in approximating the data distribution, as evidenced by low model perplexity. Furthermore, the mid-price returns calculated from the generated order flow exhibit a significant correlation with the data, indicating impressive conditional forecast performance. Due to the granularity of generated data, and the accuracy of the model, it offers new application areas for future work beyond forecasting, e.g. acting as a world model in high-frequency financial reinforcement learning applications. Overall, our results invite the use and extension of the model in the direction of autoregressive large financial models for the generation of high-frequency financial data and we commit to open-sourcing our code to facilitate future research.

Related papers

Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance. Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws. We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z)
LOB-Bench: Benchmarking Generative AI for Finance - an Application to Limit Order Book Data [7.317765812144531]
We present a benchmark designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times.
arXiv Detail & Related papers (2025-02-13T10:56:58Z)
Synthetic Data for Portfolios: A Throw of the Dice Will Never Abolish Chance [0.0]
This paper contributes to a deeper understanding of the limitations of generative models, particularly in portfolio and risk management. We propose a pipeline for the generation of multivariate returns that meets conventional evaluation standards on a large universe of US equities.
arXiv Detail & Related papers (2025-01-07T18:50:24Z)
Beyond Tree Models: A Hybrid Model of KAN and gMLP for Large-Scale Financial Tabular Data [28.34587057844627]
TKGMLP is a hybrid network for tabular data that combines shallow Kolmogorov Arnold Networks with Gated Multilayer Perceptron. We validate TKGMLP on a real-world credit scoring dataset, where it achieves state-of-the-art results and outperforms current benchmarks. We propose a novel feature encoding method for numerical data, specifically designed to address the predominance of numerical features in financial datasets.
arXiv Detail & Related papers (2024-12-03T02:38:07Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
A Financial Time Series Denoiser Based on Diffusion Model [1.5193212081459284]
This paper introduces a novel approach utilizing the diffusion model as a denoiser for financial time series. Trading signals derived from the denoised data yield more profitable trades with fewer transactions.
arXiv Detail & Related papers (2024-09-02T15:55:36Z)
Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior.
arXiv Detail & Related papers (2024-07-16T14:40:07Z)
F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data [65.6499834212641]
We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm. By considering domain similarities through task-specific metadata, our model improved generalization, where the excess risk decreases as the number of training tasks increases. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
arXiv Detail & Related papers (2024-06-23T21:28:50Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
Towards a Foundation Purchasing Model: Pretrained Generative Autoregression on Transaction Sequences [0.0]
We present a generative pretraining method that can be used to obtain contextualised embeddings of financial transactions. We additionally perform large-scale pretraining of an embedding model using a corpus of data from 180 issuing banks containing 5.1 billion transactions.
arXiv Detail & Related papers (2024-01-03T09:32:48Z)
DSLOB: A Synthetic Limit Order Book Dataset for Benchmarking Forecasting Algorithms under Distributional Shift [16.326002979578686]
In electronic trading markets, limit order books (LOBs) provide information about pending buy/sell orders at various price levels for a given security. Recently, there has been a growing interest in using LOB data for resolving downstream machine learning tasks.
arXiv Detail & Related papers (2022-11-17T06:33:27Z)
The LOB Recreation Model: Predicting the Limit Order Book from TAQ History Using an Ordinary Differential Equation Recurrent Neural Network [9.686252465354274]
We present the LOB recreation model, a first attempt from a deep learning perspective to recreate the top five price levels of the public limit order book (LOB) for small-tick stocks. By the paradigm of transfer learning, the source model trained on one stock can be fine-tuned to enable application to other financial assets of the same class.
arXiv Detail & Related papers (2021-03-02T12:07:43Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
Generating Realistic Stock Market Order Streams [18.86755130031027]
We propose an approach to generate realistic and high-fidelity stock market data based on generative adversarial networks (GANs) Our Stock-GAN model employs a conditional Wasserstein GAN to capture history dependence of orders.
arXiv Detail & Related papers (2020-06-07T17:32:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.