KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs
- URL: http://arxiv.org/abs/2601.01046v1
- Date: Sat, 03 Jan 2026 02:55:43 GMT
- Title: KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs
- Authors: Yixuan Tang, Yi Yang,
- Abstract summary: We propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs.<n>Our method leverages the observation that the key-value (KV) states of the final token at each layer encode a compressed view of the sequence.<n>We show that KV-Embedding outperforms existing training-free baselines by up to 10%, while maintaining robust performance on sequences up to 4,096 tokens.
- Score: 12.949322198287417
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While LLMs are powerful embedding backbones, their application in training-free settings faces two structural challenges: causal attention restricts early tokens from accessing subsequent context, and the next-token prediction objective biases representations toward generation rather than semantic compression. To address these limitations, we propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs. Our method leverages the observation that the key-value (KV) states of the final token at each layer encode a compressed view of the sequence. By re-routing these states as a prepended prefix, we enable all tokens to access sequence-level context within a single forward pass. To ensure model-agnostic applicability, we introduce an automated layer selection strategy based on intrinsic dimensionality. Evaluations on MTEB across Qwen, Mistral, and Llama backbones show that KV-Embedding outperforms existing training-free baselines by up to 10%, while maintaining robust performance on sequences up to 4,096 tokens. These results demonstrate that internal state manipulation offers an efficient alternative to input modification, and we hope this work encourages further exploration of LLM internals for representation learning.
Related papers
- Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks [4.888851550406879]
This paper proposes a weak-vision framework to tackle the automatic recognition of "concealed emotions" in videos.<n>Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts from under 0.6 in prior work to over 0.69.
arXiv Detail & Related papers (2026-02-08T17:02:55Z) - PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective [59.24570811503256]
We propose PIO-FVLM to reduce redundant visual tokens in vision-models (VLMs) to accelerate inference.<n>The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment.<n>On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance.
arXiv Detail & Related papers (2026-02-04T15:33:10Z) - LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States [13.418437639290532]
Sentence representations are foundational to many Natural Language Processing (NLP) applications.<n>This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states.
arXiv Detail & Related papers (2026-02-02T03:09:37Z) - IPCV: Information-Preserving Compression for MLLM Visual Encoders [44.76073540999133]
IPCV is a training-free, information-preserving compression framework for MLLM visual encoders.<n>We introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning.<n>IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods.
arXiv Detail & Related papers (2025-12-21T14:28:28Z) - One Last Attention for Your Vision-Language Model [42.872184600248914]
We propose textbfRational textbfAdaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning.<n> RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix.<n>Experiments show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.
arXiv Detail & Related papers (2025-07-21T10:35:32Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [35.38573641029626]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.<n>Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment [84.74716380180428]
We propose AutoRegEmbed, a contrastive learning method built on embedding conditional probability distributions.<n>We show that our method significantly outperforms traditional contrastive learning approaches.
arXiv Detail & Related papers (2025-02-17T03:36:25Z) - In-context KV-Cache Eviction for LLMs via Attention-Gate [12.732519329131392]
The KV-Cache technique has become the standard for the inference of large language models (LLMs)<n>This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model.<n>We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.
arXiv Detail & Related papers (2024-10-15T05:01:19Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [111.49781716597984]
We propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
We can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
arXiv Detail & Related papers (2023-04-06T18:00:04Z) - Towards Robust Low-Resource Fine-Tuning with Multi-View Compressed
Representations [51.75960511842552]
Fine-tuning of pretrained language models (PLMs) is prone to overfitting in the low resource scenarios.
We present a novel method that operates on the hidden representations of a PLM to reduce overfitting.
arXiv Detail & Related papers (2022-11-16T09:39:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.