Related papers: PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

URL: http://arxiv.org/abs/2511.17637v1
Date: Wed, 19 Nov 2025 08:46:26 GMT
Title: PocketLLM: Ultimate Compression of Large Language Models via Meta Networks
Authors: Ye Tian, Chengcheng Wang, Jing Han, Yehui Tang, Kai Han,
Abstract summary: We introduce PocketLLM, a novel approach to compress Large Language Models.<n>A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors.<n>A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space.
Score: 43.829543128192455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.

Related papers

Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction [9.302754209202607]
Large language models (LLMs) continue to be deployed and utilized across domains.<n> compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content.<n>We show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip.
arXiv Detail & Related papers (2025-05-07T17:42:35Z)
Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference [19.59857352852377]
Large language models (LLMs) have continued to rapidly increase in size.<n>This has exacerbated the difficulty in running state of the art LLMs on small, edge devices.<n>We propose Huff-LLM, a method that lets users store LLM weights in compressed format.
arXiv Detail & Related papers (2025-02-02T21:23:42Z)
Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need [53.584140947828004]
Language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities. We propose P$2$-LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies. Experiments on benchmark datasets demonstrate that P$2$-LLM can beat SOTA classical and learned codecs.
arXiv Detail & Related papers (2024-11-19T12:15:40Z)
Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression [5.206085750261924]
Large Language Models (LLMs) require significant amount of memory storage in inference. In this paper, we take a step further to explore parameter sharing across different layers with singular value decomposition. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches.
arXiv Detail & Related papers (2024-10-02T14:30:02Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models [19.502740996431452]
Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression. We propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes.
arXiv Detail & Related papers (2023-09-02T15:16:35Z)
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique. SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.