Better Prompt Compression Without Multi-Layer Perceptrons
- URL: http://arxiv.org/abs/2501.06730v1
- Date: Sun, 12 Jan 2025 06:57:06 GMT
- Title: Better Prompt Compression Without Multi-Layer Perceptrons
- Authors: Edouardo Honig, Andrew Lizarraga, Zijun Frank Zhang, Ying Nian Wu,
- Abstract summary: We show that the encoder does not need to keep the original language model's architecture to achieve useful compression.
We introduce a prompt compression encoder after removing the multilayer perceptron (MLP) layers in the Transformer blocks of a language model.
- Score: 33.53334153279698
- License:
- Abstract: Prompt compression is a promising approach to speeding up language model inference without altering the generative model. Prior works compress prompts into smaller sequences of learned tokens using an encoder that is trained as a LowRank Adaptation (LoRA) of the inference language model. However, we show that the encoder does not need to keep the original language model's architecture to achieve useful compression. We introduce the Attention-Only Compressor (AOC), which learns a prompt compression encoder after removing the multilayer perceptron (MLP) layers in the Transformer blocks of a language model, resulting in an encoder with roughly 67% less parameters compared to the original model. Intriguingly we find that, across a range of compression ratios up to 480x, AOC can better regenerate prompts and outperform a baseline compression encoder that is a LoRA of the inference language model without removing MLP layers. These results demonstrate that the architecture of prompt compression encoders does not need to be identical to that of the original decoder language model, paving the way for further research into architectures and approaches for prompt compression.
Related papers
- L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression [23.179381396167084]
We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC)
RWKV models achieve the fastest decoding speed with a moderate compression ratio.
We propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens.
arXiv Detail & Related papers (2024-12-21T14:24:32Z) - DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models [72.24305287508474]
We introduce DiCoDe, a novel approach to generate videos with a language model in an autoregressive manner.
By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation.
We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality.
arXiv Detail & Related papers (2024-12-05T18:57:06Z) - LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency.
We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one.
Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Say More with Less: Understanding Prompt Learning Behaviors through Gist
Compression [39.233017243612025]
Large language models (LLMs) require lengthy prompts as the input context to produce output aligned with user intentions.
We propose a novel method for compressing prompts which also can assist the prompt interpretation and engineering.
Gist-COCO employs an encoder-decoder based language model and then incorporates an additional encoder as a plugin module to compress prompts with inputs using gist tokens.
arXiv Detail & Related papers (2024-02-25T11:07:08Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Video Coding Using Learned Latent GAN Compression [1.6058099298620423]
We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video.
Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned.
arXiv Detail & Related papers (2022-07-09T19:07:43Z) - Conditional Entropy Coding for Efficient Video Compression [82.35389813794372]
We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames.
We first show that a simple architecture modeling the entropy between the image latent codes is as competitive as other neural video compression works and video codecs.
We then propose a novel internal learning extension on top of this architecture that brings an additional 10% savings without trading off decoding speed.
arXiv Detail & Related papers (2020-08-20T20:01:59Z) - A flexible, extensible software framework for model compression based on
the LC algorithm [10.787390511207683]
We propose a software framework that allows a user to compress a neural network or other machine learning model with minimal effort.
The library is written in Python and PyTorch and available in Github.
arXiv Detail & Related papers (2020-05-15T21:14:48Z) - Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system.
Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame.
Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.