Related papers: Innovative tokenisation of structured data for LLM training

Innovative tokenisation of structured data for LLM training

URL: http://arxiv.org/abs/2508.01685v1
Date: Sun, 03 Aug 2025 09:29:50 GMT
Title: Innovative tokenisation of structured data for LLM training
Authors: Kayvan Karim, Hani Ragab Hassen. Hadj Batatia,
Abstract summary: This paper introduces a novel, hybrid tokenisation methodology to convert structured data into a sequential format suitable for training Large Language Models (LLMs)<n>We show that our method is highly efficient, processing over 31 million network flows in under five hours and achieving a significant data compression ratio of 6.18:1.<n>This process resulted in a computationally manageable corpus of over one billion tokens, establishing a viable and generalisable pathway for training foundation models on structured data.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data representation remains a fundamental challenge in machine learning, particularly when adapting sequence-based architectures like Transformers and Large Language Models (LLMs) for structured tabular data. Existing methods often fail to cohesively encode the mix of numerical and categorical features or preserve the inherent structure of tables. This paper introduces a novel, hybrid tokenisation methodology designed to convert tabular data into a unified, sequential format suitable for LLM training. Our approach combines predefined fixed tokens to represent structural elements and low-cardinality categorical features, with a learned subword vocabulary using Byte-Pair Encoding (BPE) for high-cardinality and continuous values. We demonstrate the efficacy of this technique by applying it to a large-scale NetFlow dataset (CIDDS-001), preparing a corpus for a Network Intrusion Detection System (NIDS) foundation model. The evaluation shows that our method is highly efficient, processing over 31 million network flows in under five hours and achieving a significant data compression ratio of 6.18:1. This process resulted in a computationally manageable corpus of over one billion tokens, establishing a viable and generalisable pathway for training foundation models on structured data.

Related papers

Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z)
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling [17.277753030570263]
We introduce techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies.<n> incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end.<n>Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction.<n>H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without anys or explicit supervision.
arXiv Detail & Related papers (2025-07-10T17:39:37Z)
Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z)
The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats [0.0]
This study systematically evaluating Large Language Models' ability to convert unstructured text into structured formats.<n>Experiments reveal that GPT-4o with few-shot prompting achieves breakthrough performance.<n>These findings open new possibilities for automated structured data generation across various domains.
arXiv Detail & Related papers (2025-03-04T14:14:28Z)
Prior-Fitted Networks Scale to Larger Datasets When Treated as Weak Learners [82.72552644267724]
BoostPFN can outperform standard PFNs with the same size of training samples in large datasets.<n>High performance is maintained for up to 50x of the pre-training size of PFNs.
arXiv Detail & Related papers (2025-03-03T07:31:40Z)
Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs)<n>Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.<n>We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z)
Distributionally robust self-supervised learning for tabular data [2.942619386779508]
Learning robust representation in presence of error slices is challenging, due to high cardinality features and the complexity of constructing error sets.<n>Traditional robust representation learning methods are largely focused on improving worst group performance in supervised setting in computer vision.<n>Our approach utilizes an encoder-decoder model trained with Masked Language Modeling (MLM) loss to learn robust latent representations.
arXiv Detail & Related papers (2024-10-11T04:23:56Z)
ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models [65.82630283336051]
We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models. We present a simple fix to this problem by constructing processes that fully exploit the structures, hence the name ComboStoc.
arXiv Detail & Related papers (2024-05-22T15:23:10Z)
EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models [39.347666307218006]
Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications.<n>We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets.
arXiv Detail & Related papers (2024-04-15T17:49:16Z)
On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL [8.57550491437633]
This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model's ability to mimic human-designed processes such as schema linking and syntax prediction. We also uncover insights into the model's internal mechanisms, including the ego-centric nature of structure node encodings.
arXiv Detail & Related papers (2024-04-03T01:16:20Z)
DictLLM: Harnessing Key-Value Data Structures with Large Language Models for Enhanced Medical Diagnostics [36.057925881268226]
DictLLM is an innovative framework designed to improve the modeling of key-value structured data, like medical laboratory reports, for generating medical diagnoses. We carry out experiments using various LLM models on a comprehensive real-world medical laboratory report dataset for automatic diagnosis generation.
arXiv Detail & Related papers (2024-02-18T07:10:02Z)
StructGPT: A General Framework for Large Language Model to Reason over Structured Data [117.13986738340027]
We develop an emphIterative Reading-then-Reasoning(IRR) approach for solving question answering tasks based on structured data. Our approach can significantly boost the performance of ChatGPT and achieve comparable performance against the full-data supervised-tuning baselines.
arXiv Detail & Related papers (2023-05-16T17:45:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.