Related papers: Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Towards Lossless Token Pruning in Late-Interaction Retrieval Models

URL: http://arxiv.org/abs/2504.12778v1
Date: Thu, 17 Apr 2025 09:18:58 GMT
Title: Towards Lossless Token Pruning in Late-Interaction Retrieval Models
Authors: Yuxuan Zong, Benjamin Piwowarski,
Abstract summary: Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks.<n>They require a huge memory space to store the contextual representation for all the document tokens.<n>We propose a principled approach to define how to prune tokens without impacting the score between a document and a query.
Score: 10.983837305643723
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

Related papers

Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More [18.928285521147057]
We show that importance is not an ideal indicator to decide whether a token should be pruned. We propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance.
arXiv Detail & Related papers (2025-02-17T06:56:28Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. During inference, ElasticTok can dynamically allocate tokens when needed. Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z)
Tree Cross Attention [59.8891512435847]
Tree Cross Attention (TCA) is a module based on Cross Attention that only retrieves information from a logarithmic $mathcalO(log(N))$ number of tokens for performing inference. We show that TCA performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient.
arXiv Detail & Related papers (2023-09-29T16:50:23Z)
Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery [64.37035857740781]
We present EANet, extract-and-adaptation network, with EABlock, the main component of our network. Our two novel tokens are from a combination of separated two hand features; hence, it is much more robust to the distant token problem. The proposed EANet achieves the state-of-the-art performance on 3D interacting hands benchmarks.
arXiv Detail & Related papers (2023-09-05T04:18:03Z)
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation [89.88214896713846]
STA score considers two critical factors: temporal redundancy and semantic importance. We apply the STA module to off-the-shelf video Transformers and Videowins. Results: Kinetics-400 and Something-Something V2 achieve 30% overshelf reduction with a negligible 0.2% accuracy drop.
arXiv Detail & Related papers (2023-08-08T19:38:15Z)
Revisiting Token Pruning for Object Detection and Instance Segmentation [25.3324628669201]
We investigate token pruning to accelerate inference for object and instance segmentation. We show a reduction in performance decline from 1.5 mAP to 0.3 mAP in both boxes and masks, compared to existing token pruning methods.
arXiv Detail & Related papers (2023-06-12T11:55:33Z)
Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens. Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z)
CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval [72.90850213615427]
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers. These methods are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. We propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.
arXiv Detail & Related papers (2022-11-18T18:27:35Z)
Breaking BERT: Evaluating and Optimizing Sparsified Attention [13.529939025511242]
We evaluate the impact of sparsification patterns with a series of ablation experiments. We find that even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers.
arXiv Detail & Related papers (2022-10-07T22:32:27Z)
AdapLeR: Speeding up Inference by Adaptive Length Reduction [15.57872065467772]
We propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance.
arXiv Detail & Related papers (2022-03-16T23:41:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.