Related papers: Revisiting Data Compression with Language Modeling

Revisiting Data Compression with Language Modeling

URL: http://arxiv.org/abs/2601.02875v1
Date: Tue, 06 Jan 2026 10:03:33 GMT
Title: Revisiting Data Compression with Language Modeling
Authors: Chen-Han Tsai,
Abstract summary: We investigate the potential use of large language models (LLM's) in the task of data compression.<n>We achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18%$ on the enwik9 dataset.<n>We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18\%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.

Related papers

LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report) [4.2414540423650795]
LLMCOMP is a lossy compression paradigm that leverages decoder-only large language models to model scientific data.<n>It consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds.
arXiv Detail & Related papers (2025-10-24T05:41:04Z)
Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction [9.302754209202607]
Large language models (LLMs) continue to be deployed and utilized across domains.<n> compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content.<n>We show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip.
arXiv Detail & Related papers (2025-05-07T17:42:35Z)
Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment [84.74716380180428]
We propose AutoRegEmbed, a contrastive learning method built on embedding conditional probability distributions.<n>We show that our method significantly outperforms traditional contrastive learning approaches.
arXiv Detail & Related papers (2025-02-17T03:36:25Z)
Efficient Long Context Language Model Retrieval with Compression [57.09163579304332]
Long Context Language Models (LCLMs) have emerged as a new paradigm to perform Information Retrieval (IR)<n>We propose a new compression approach tailored for LCLM retrieval, which is trained to maximize the retrieval performance while minimizing the length of the compressed passages.<n>We show that CoLoR improves the retrieval performance by 6% while compressing the in-context size by a factor of 1.91.
arXiv Detail & Related papers (2024-12-24T07:30:55Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.<n>We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.<n>We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
Semantic Compression With Large Language Models [1.0874100424278175]
Large language models (LLMs) are revolutionizing information retrieval, question answering, summarization, and code generation tasks. LLMs are inherently limited by the number of input and output tokens that can be processed at once. This paper presents three contributions to research on LLMs.
arXiv Detail & Related papers (2023-04-25T01:47:05Z)
ZipLM: Inference-Aware Structured Pruning of Language Models [56.52030193434863]
We propose a novel structured compression approach for large language models (LLMs) called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups. ZipLM produces state-of-the-art compressed models across all settings.
arXiv Detail & Related papers (2023-02-07T18:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.