Related papers: Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

URL: http://arxiv.org/abs/2511.14868v1
Date: Tue, 18 Nov 2025 19:37:40 GMT
Title: Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings
Authors: Xueying Ding, Xingyue Huang, Mingxuan Ju, Liam Collins, Yozen Liu, Leman Akoglu, Neil Shah, Tong Zhao,
Abstract summary: We propose Hierarchical Token Prepending to mitigate attention-level compression and readout-level over-squashing.<n>HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating pathways for backward information flow.<n>As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
Score: 52.49524240846879
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.

Related papers

Stacked from One: Multi-Scale Self-Injection for Context Window Extension [69.24689919827817]
modelname is a novel framework based on multi-grained context compression and query-aware information acquisition.<n>modelnameachieves performance superior or comparable to strong baselines.
arXiv Detail & Related papers (2026-03-05T03:16:16Z)
Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z)
Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation [49.48204107529758]
We define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query.<n>In this paper, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations.<n>Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average.<n>These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
arXiv Detail & Related papers (2026-02-12T18:15:08Z)
CompactPrompt: A Unified Pipeline for Prompt Data Compression in LLM Workflows [0.9275065651255189]
Large Language Models (LLMs) deliver powerful reasoning and generation capabilities but incur substantial run-time costs.<n>We introduce CompactPrompt, an end-to-end pipeline that merges hard prompt compression with lightweight file-level data compression.
arXiv Detail & Related papers (2025-10-20T19:31:11Z)
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge [50.93758649363798]
Impliret is a benchmark that shifts the reasoning challenge to document-side processing.<n>We evaluate a range of sparse and dense retrievers, all of which struggle in this setting.
arXiv Detail & Related papers (2025-06-17T11:08:29Z)
Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs [23.960451986662996]
This paper proposes a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought reasoning.<n>We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text.
arXiv Detail & Related papers (2025-02-18T02:49:40Z)
BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.<n>This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.<n>Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z)
ChuLo: Chunk-Level Key Information Representation for Long Document Understanding [11.29459225491404]
ChuLo is a novel chunk representation method for long document understanding.<n>Our approach minimizes information loss and improves the efficiency of Transformer-based models.
arXiv Detail & Related papers (2024-10-14T22:06:54Z)
REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking [11.374031643273941]
REXEL is a highly efficient and accurate model for the joint task of document level cIE (DocIE) It is on average 11 times faster than competitive existing approaches in a similar setting. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale.
arXiv Detail & Related papers (2024-04-19T11:04:27Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
SDR: Efficient Neural Re-ranking using Succinct Document Representation [4.9278175139681215]
We propose the Succinct Document Representation scheme that computes emphhighly compressed intermediate document representations. Our method is highly efficient, achieving 4x-11.6x better compression rates for the same ranking quality.
arXiv Detail & Related papers (2021-10-03T07:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.