Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs
- URL: http://arxiv.org/abs/2511.23271v1
- Date: Fri, 28 Nov 2025 15:22:52 GMT
- Title: Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs
- Authors: Jiancheng Dong, Pengyue Jia, Jingyu Peng, Maolin Wang, Yuhao Wang, Lixin Su, Xin Sun, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao,
- Abstract summary: We propose a lightweight training framework that learns a single prompt-specific Behavior-Equivalent token ([BE])<n>The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt's downstream behavior into this single token.<n> Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts.
- Score: 55.827877498548965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.
Related papers
- ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models [7.7352936204066]
We propose a novel, zero-shot method to model visual token pruning as a balance between task relevance and information diversity.<n>Our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss.<n>These gains are accompanied by significant reductions in GPU memory footprint and inference latency.
arXiv Detail & Related papers (2025-10-20T06:18:47Z) - All You Need is One: Capsule Prompt Tuning with a Single Vector [86.68105855537762]
Current prompt-based learning methods rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts.<n>We introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning.<n>Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner.
arXiv Detail & Related papers (2025-10-19T00:02:59Z) - FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z) - SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning [3.502168555273189]
SlimInfer aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass.<n>We show that SlimInfer can achieve up to $mathbf2.53times$ time-to-first-token (TTFT) speedup and $mathbf1.88times$ end-to-end latency reduction for LLaMA3.1-8B-Instruct.
arXiv Detail & Related papers (2025-08-08T16:42:38Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [35.38573641029626]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning [76.32953653161417]
Class-incremental learning enables models to learn new classes progressively while preserving knowledge of previously learned ones.<n>Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques.<n>We present a novel prompt-based approach that addresses the limitation of current approaches.
arXiv Detail & Related papers (2025-03-11T02:27:37Z) - Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product [8.014705094248589]
Low- parameters prompt tuning method outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.<n>Experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
arXiv Detail & Related papers (2025-02-16T05:50:12Z) - InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs)<n>We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm.<n>Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z) - BatchPrompt: Accomplish more with less [9.204837699571788]
BatchPrompt is an efficient way to batch data within the token limit.
To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling.
This is the first work to technically improve prompting efficiency of large language models.
arXiv Detail & Related papers (2023-09-01T10:44:36Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.