Related papers: Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction

Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction

URL: http://arxiv.org/abs/2505.06297v1
Date: Wed, 07 May 2025 17:42:35 GMT
Title: Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction
Authors: Yu Mao, Holger Pirk, Chun Jason Xue,
Abstract summary: Large language models (LLMs) continue to be deployed and utilized across domains.<n> compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content.<n>We show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip.
Score: 9.302754209202607
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As large language models (LLMs) continue to be deployed and utilized across domains, the volume of LLM-generated data is growing rapidly. This trend highlights the increasing importance of effective and lossless compression for such data in modern text management systems. However, compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content. Traditional machine-generated data is typically derived from computational processes or device outputs, often highly structured and limited to low-level elements like labels or numerical values. This structure enables conventional lossless compressors to perform efficiently. In contrast, LLM-generated data is more complex and diverse, requiring new approaches for effective compression. In this work, we conduct the first systematic investigation of lossless compression techniques tailored specifically to LLM-generated data. Notably, because LLMs are trained via next-token prediction, we find that LLM-generated data is highly predictable for the models themselves. This predictability enables LLMs to serve as efficient compressors of their own outputs. Through extensive experiments with 14 representative LLMs and 8 LLM-generated datasets from diverse domains, we show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip, a widely used general-purpose compressor. Furthermore, this advantage holds across different LLM sizes and dataset types, demonstrating the robustness and practicality of LLM-based methods in lossless text compression under generative AI workloads.

Related papers

Revisiting Data Compression with Language Modeling [0.0]
We investigate the potential use of large language models (LLM's) in the task of data compression.<n>We achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18%$ on the enwik9 dataset.<n>We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.
arXiv Detail & Related papers (2026-01-06T10:03:33Z)
LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report) [4.2414540423650795]
LLMCOMP is a lossy compression paradigm that leverages decoder-only large language models to model scientific data.<n>It consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds.
arXiv Detail & Related papers (2025-10-24T05:41:04Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors [25.086118164540974]
Large language models (LLMs) are used to improve the performance of click-through rate (CTR) prediction. As user sequences grow longer, the current efficiency of LLMs is inadequate for training on billions of users and items. We propose Behavior Aggregated Hierarchical (BAHE) to enhance the efficiency of LLM-based CTR modeling.
arXiv Detail & Related papers (2024-03-28T12:05:15Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models. We propose a soft prompt learning method where we expose the compressed model to the prompt learning process. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction. Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.