Generative AI for End-to-End Limit Order Book Modelling: A Token-Level
Autoregressive Generative Model of Message Flow Using a Deep State Space
Network
- URL: http://arxiv.org/abs/2309.00638v1
- Date: Wed, 23 Aug 2023 09:37:22 GMT
- Title: Generative AI for End-to-End Limit Order Book Modelling: A Token-Level
Autoregressive Generative Model of Message Flow Using a Deep State Space
Network
- Authors: Peer Nagy, Sascha Frey, Silvia Sapora, Kang Li, Anisoara Calinescu,
Stefan Zohren, Jakob Foerster
- Abstract summary: We propose an end-to-end autoregressive generative model that generates tokenized limit order book (LOB) messages.
Using NASDAQ equity LOBs, we develop a custom tokenizer for message data, converting groups of successive digits to tokens.
Results show promising performance in approximating the data distribution, as evidenced by low model perplexity.
- Score: 7.54290390842336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing a generative model of realistic order flow in financial markets is
a challenging open problem, with numerous applications for market participants.
Addressing this, we propose the first end-to-end autoregressive generative
model that generates tokenized limit order book (LOB) messages. These messages
are interpreted by a Jax-LOB simulator, which updates the LOB state. To handle
long sequences efficiently, the model employs simplified structured state-space
layers to process sequences of order book states and tokenized messages. Using
LOBSTER data of NASDAQ equity LOBs, we develop a custom tokenizer for message
data, converting groups of successive digits to tokens, similar to tokenization
in large language models. Out-of-sample results show promising performance in
approximating the data distribution, as evidenced by low model perplexity.
Furthermore, the mid-price returns calculated from the generated order flow
exhibit a significant correlation with the data, indicating impressive
conditional forecast performance. Due to the granularity of generated data, and
the accuracy of the model, it offers new application areas for future work
beyond forecasting, e.g. acting as a world model in high-frequency financial
reinforcement learning applications. Overall, our results invite the use and
extension of the model in the direction of autoregressive large financial
models for the generation of high-frequency financial data and we commit to
open-sourcing our code to facilitate future research.
Related papers
- LOB-Bench: Benchmarking Generative AI for Finance - an Application to Limit Order Book Data [7.317765812144531]
We present a benchmark designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB)
Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data.
The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times.
arXiv Detail & Related papers (2025-02-13T10:56:58Z) - Synthetic Data for Portfolios: A Throw of the Dice Will Never Abolish Chance [0.0]
This paper aims to contribute to a deeper understanding of the limitations of generative models, particularly in portfolio and risk management.
We highlight the inseparable nature of model development and the desired use case by touching on a paradox: generic generative models inherently care less about what is important for constructing portfolios.
arXiv Detail & Related papers (2025-01-07T18:50:24Z) - Beyond Tree Models: A Hybrid Model of KAN and gMLP for Large-Scale Financial Tabular Data [28.34587057844627]
TKGMLP is a hybrid network for tabular data that combines shallow Kolmogorov Arnold Networks with Gated Multilayer Perceptron.
We validate TKGMLP on a real-world credit scoring dataset, where it achieves state-of-the-art results and outperforms current benchmarks.
We propose a novel feature encoding method for numerical data, specifically designed to address the predominance of numerical features in financial datasets.
arXiv Detail & Related papers (2024-12-03T02:38:07Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data [65.6499834212641]
We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm.
By considering domain similarities through task-specific metadata, our model improved generalization, where the excess risk decreases as the number of training tasks increases.
Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
arXiv Detail & Related papers (2024-06-23T21:28:50Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Towards a Foundation Purchasing Model: Pretrained Generative
Autoregression on Transaction Sequences [0.0]
We present a generative pretraining method that can be used to obtain contextualised embeddings of financial transactions.
We additionally perform large-scale pretraining of an embedding model using a corpus of data from 180 issuing banks containing 5.1 billion transactions.
arXiv Detail & Related papers (2024-01-03T09:32:48Z) - The LOB Recreation Model: Predicting the Limit Order Book from TAQ
History Using an Ordinary Differential Equation Recurrent Neural Network [9.686252465354274]
We present the LOB recreation model, a first attempt from a deep learning perspective to recreate the top five price levels of the public limit order book (LOB) for small-tick stocks.
By the paradigm of transfer learning, the source model trained on one stock can be fine-tuned to enable application to other financial assets of the same class.
arXiv Detail & Related papers (2021-03-02T12:07:43Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.