LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)
- URL: http://arxiv.org/abs/2510.23632v2
- Date: Tue, 04 Nov 2025 20:59:30 GMT
- Title: LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)
- Authors: Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Panos Kalnis,
- Abstract summary: LLMCOMP is a lossy compression paradigm that leverages decoder-only large language models to model scientific data.<n>It consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds.
- Score: 4.2414540423650795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds. Experiments on multiple reanalysis datasets show that LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds. These results highlight the potential of LLMs as general-purpose compressors for high-fidelity scientific data.
Related papers
- Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z) - QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z) - Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction [9.302754209202607]
Large language models (LLMs) continue to be deployed and utilized across domains.<n> compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content.<n>We show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip.
arXiv Detail & Related papers (2025-05-07T17:42:35Z) - Guaranteed Conditional Diffusion: 3D Block-based Models for Scientific Data Compression [10.848192105624848]
This paper proposes a new compression paradigm -- Guaranteed Diffusion with Conditional Correction (GCDTC)<n>It consists of a conditional diffusion model, tensor correction, and error guarantee.<n>Our framework outperforms standard convolutional autoencoders and yields competitive compression quality with an existing scientific data compression algorithm.
arXiv Detail & Related papers (2025-02-18T15:33:09Z) - Foundation Model for Lossy Compression of Spatiotemporal Scientific Data [11.494915987840876]
We present a foundation model (FM) for lossy scientific data compression.<n>We combine a variational autoencoder (E) with a hyper-prior structure and a super-resolution (SR) module.
arXiv Detail & Related papers (2024-12-22T22:57:08Z) - Variable Rate Neural Compression for Sparse Detector Data [9.331686712558144]
We propose a novel approach for TPC data compression via key-point identification facilitated by sparse convolution.
BCAE-VS achieves a $75%$ improvement in reconstruction accuracy with a $10%$ increase in compression ratio over the previous state-of-the-art model.
arXiv Detail & Related papers (2024-11-18T17:15:35Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - NeurLZ: An Online Neural Learning-Based Method to Enhance Scientific Lossy Compression [34.30562110131907]
NeurLZ is a neural method designed to enhance lossy compression by integrating online learning, cross-field learning, and robust error regulation.<n>During the first five learning epochs, NeurLZ achieves an 89% bit rate reduction, with further optimization yielding up to around 94% reduction at equivalent distortion.
arXiv Detail & Related papers (2024-09-09T16:48:09Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.