LMK > CLS: Landmark Pooling for Dense Embeddings
- URL: http://arxiv.org/abs/2601.21525v1
- Date: Thu, 29 Jan 2026 10:40:37 GMT
- Title: LMK > CLS: Landmark Pooling for Dense Embeddings
- Authors: Meet Doshi, Aashka Trivedi, Vishwajeet Kumar, Parul Awasthy, Yulong Li, Jaydeep Sen, Radu Florian, Sachindra Joshi,
- Abstract summary: We introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings.<n>This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens.
- Score: 18.49372789918725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.
Related papers
- CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill [4.440373965918973]
We introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt.<n>This oracle reveals that existing oracles exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks.<n>We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the upper bound and reduces Time-to-First-Token (TTFT) by up to 39% compared to the Full KV Cache baseline.
arXiv Detail & Related papers (2026-02-17T22:08:16Z) - Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z) - NILC: Discovering New Intents with LLM-assisted Clustering [15.077590298929719]
New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances.<n>This paper proposes NILC, a novel clustering framework specially catered for effective NID.
arXiv Detail & Related papers (2025-11-08T08:18:44Z) - Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs [17.499497967319332]
We introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining.<n>DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.<n>Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.
arXiv Detail & Related papers (2025-10-28T16:34:18Z) - Exploiting Discriminative Codebook Prior for Autoregressive Image Generation [54.14166700058777]
token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm.<n>While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited.<n>Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook.<n>We propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means
arXiv Detail & Related papers (2025-08-14T15:00:00Z) - Summaries as Centroids for Interpretable and Scalable Text Clustering [0.0]
We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries.<n>The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes.
arXiv Detail & Related papers (2025-02-12T19:50:22Z) - Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC)<n>SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.<n>We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting [67.97870844244187]
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image.<n>We propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (Net)<n>It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size.
arXiv Detail & Related papers (2023-11-16T16:50:56Z) - Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks.
We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z) - [CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP.
Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder.
Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z) - Revisiting Deep Local Descriptor for Improved Few-Shot Classification [56.74552164206737]
We show how one can improve the quality of embeddings by leveraging textbfDense textbfClassification and textbfAttentive textbfPooling.
We suggest to pool feature maps by applying attentive pooling instead of the widely used global average pooling (GAP) to prepare embeddings for few-shot classification.
arXiv Detail & Related papers (2021-03-30T00:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.