ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space
- URL: http://arxiv.org/abs/2508.02247v2
- Date: Thu, 07 Aug 2025 04:31:56 GMT
- Title: ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space
- Authors: Yang Li, Zhi Chen,
- Abstract summary: We introduce ByteGen, a novel generative model that operates directly on the raw byte streams of LOB events.<n>Our work is the complete elimination of feature engineering and tokenization, enabling the model to learn market dynamics from its most fundamental representation.<n>ByteGen successfully reproduces key facts of financial markets, generating realistic price distributions, heavy-tailed returns, and bursty event timing.
- Score: 11.523583937607622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative modeling of high-frequency limit order book (LOB) dynamics is a critical yet unsolved challenge in quantitative finance, essential for robust market simulation and strategy backtesting. Existing approaches are often constrained by simplifying stochastic assumptions or, in the case of modern deep learning models like Transformers, rely on tokenization schemes that affect the high-precision, numerical nature of financial data through discretization and binning. To address these limitations, we introduce ByteGen, a novel generative model that operates directly on the raw byte streams of LOB events. Our approach treats the problem as an autoregressive next-byte prediction task, for which we design a compact and efficient 32-byte packed binary format to represent market messages without information loss. The core novelty of our work is the complete elimination of feature engineering and tokenization, enabling the model to learn market dynamics from its most fundamental representation. We achieve this by adapting the H-Net architecture, a hybrid Mamba-Transformer model that uses a dynamic chunking mechanism to discover the inherent structure of market messages without predefined rules. Our primary contributions are: 1) the first end-to-end, byte-level framework for LOB modeling; 2) an efficient packed data representation; and 3) a comprehensive evaluation on high-frequency data. Trained on over 34 million events from CME Bitcoin futures, ByteGen successfully reproduces key stylized facts of financial markets, generating realistic price distributions, heavy-tailed returns, and bursty event timing. Our findings demonstrate that learning directly from byte space is a promising and highly flexible paradigm for modeling complex financial systems, achieving competitive performance on standard market quality metrics without the biases of tokenization.
Related papers
- Token Communication in the Era of Large Models: An Information Bottleneck-Based Approach [55.861432910722186]
UniToCom is a unified token communication paradigm that treats tokens as the fundamental units for both processing and wireless transmission.<n>We propose a generative information bottleneck (GenIB) principle, which facilitates the learning of tokens that preserve essential information.<n>We employ a causal Transformer-based multimodal large language model (MLLM) at the receiver to unify the processing of both discrete and continuous tokens.
arXiv Detail & Related papers (2025-07-02T14:03:01Z) - FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z) - TRADES: Generating Realistic Market Simulations with Diffusion Models [4.308104021015939]
Financial markets are complex systems characterized by high statistical noise, nonlinearity, and constant evolution.<n>We address the task of generating realistic and responsive Limit Order Book (LOB) market simulations.<n>We propose a novel Denoising Diffusion Probabilistic Engine for LOB Simulations (TRADES)
arXiv Detail & Related papers (2025-01-31T19:43:13Z) - STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized Variational Autoencoders for Financial Trading [55.02735046724146]
In financial trading, factor models are widely used to price assets and capture excess returns from mispricing.<n>We propose a Spatio-Temporal factOR Model based on dual vector quantized variational autoencoders, named STORM.<n>Storm extracts features of stocks from temporal and spatial perspectives, then fuses and aligns these features at the fine-grained and semantic level, and represents the factors as multi-dimensional embeddings.
arXiv Detail & Related papers (2024-12-12T17:15:49Z) - MCI-GRU: Stock Prediction Model Based on Multi-Head Cross-Attention and Improved GRU [15.232546605091818]
This paper proposes a stock prediction model, MCI-GRU, based on a multi-head cross-attention mechanism and an improved GRU.<n> Experiments on four main stock markets show that the proposed method outperforms SOTA techniques across multiple metrics.
arXiv Detail & Related papers (2024-09-25T14:37:49Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Generative AI for End-to-End Limit Order Book Modelling: A Token-Level
Autoregressive Generative Model of Message Flow Using a Deep State Space
Network [7.54290390842336]
We propose an end-to-end autoregressive generative model that generates tokenized limit order book (LOB) messages.
Using NASDAQ equity LOBs, we develop a custom tokenizer for message data, converting groups of successive digits to tokens.
Results show promising performance in approximating the data distribution, as evidenced by low model perplexity.
arXiv Detail & Related papers (2023-08-23T09:37:22Z) - Precision-Recall Divergence Optimization for Generative Modeling with
GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows.
We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences.
Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z) - Neural Stochastic Agent-Based Limit Order Book Simulation: A Hybrid
Methodology [6.09170287691728]
Modern financial exchanges use an electronic limit order book (LOB) to store bid and ask orders for a specific financial asset.
We propose a novel hybrid LOB simulation paradigm characterised by: (1) representing the aggregation of market events' logic by a neural background trader that is pre-trained on historical LOB data through a neural point model; and (2) embedding the background trader in a multi-agent simulation with other trading agents.
We show that the stylised facts remain and we demonstrate order flow impact and financial herding behaviours that are in accordance with empirical observations of real markets.
arXiv Detail & Related papers (2023-02-28T20:53:39Z) - Transfer Ranking in Finance: Applications to Cross-Sectional Momentum
with Data Scarcity [2.3204178451683264]
We introduce Fused Networks -- a novel and hybrid parameter-sharing transfer ranking model.
The model fuses information extracted using an encoder-attention module operated on a source dataset.
It mitigates the issue of models with poor generalisability that are a consequence of training on scarce target data.
arXiv Detail & Related papers (2022-08-21T21:34:11Z) - Bayesian Bilinear Neural Network for Predicting the Mid-price Dynamics
in Limit-Order Book Markets [84.90242084523565]
Traditional time-series econometric methods often appear incapable of capturing the true complexity of the multi-level interactions driving the price dynamics.
By adopting a state-of-the-art second-order optimization algorithm, we train a Bayesian bilinear neural network with temporal attention.
By addressing the use of predictive distributions to analyze errors and uncertainties associated with the estimated parameters and model forecasts, we thoroughly compare our Bayesian model with traditional ML alternatives.
arXiv Detail & Related papers (2022-03-07T18:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.