CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference
- URL: http://arxiv.org/abs/2511.21702v1
- Date: Sun, 16 Nov 2025 14:02:41 GMT
- Title: CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference
- Authors: Dong Liu, Yanxuan Yu, Ben Lengerich,
- Abstract summary: CSV-Decode is a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step.<n>Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from vocabularies.
- Score: 4.832840038837715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-$k$ certification and $\varepsilon$-certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \href{https://github.com/FastLM/CSV-Decode}{https://github.com/FastLM/CSV-Decode}.
Related papers
- FOCUS: DLLMs Know How to Tame Their Compute Bound [10.298643186738799]
FOCUS is an inference system designed for Diffusion Large Language Models (DLLMs)<n>It focuses computation on decodable tokens and evicting non-decodable ones on-the-fly.<n>It achieves up to 3.52$times$ throughput improvement over the production-grade engine LMM.
arXiv Detail & Related papers (2026-01-30T18:52:06Z) - CHEHAB RL: Learning to Optimize Fully Homomorphic Encryption Computations [4.35834398077163]
Homomorphic Encryption (FHE) enables computations directly on encrypted data, but its high computational cost remains a significant barrier.<n>We propose CHEHAB RL, a novel framework that leverages deep reinforcement learning (RL) to automate FHE code optimization.<n>Results show that our approach generates code that is $5.3times$ faster in execution, accumulates $2.54times$ less noise, while the compilation process itself is $27.9times$ faster than Coyote.
arXiv Detail & Related papers (2026-01-27T08:49:09Z) - Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding [21.609237262034636]
Autoregressive decoding in large language models (LLMs) requires $mathcalO(n)$ sequential steps for $n$ tokens.<n>We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output.<n>This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted.
arXiv Detail & Related papers (2025-09-29T17:59:54Z) - NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z) - Ramp Up NTT in Record Time using GPU-Accelerated Algorithms and LLM-based Code Generation [11.120838175165986]
Homomorphic encryption (HE) is a core building block in privacy-preserving machine learning (PPML)<n>Many GPU-accelerated cryptographic schemes have been proposed to improve the performance of HE.<n>Given the powerful code generation capabilities of large language models (LLMs), we aim to explore their potential to automatically generate practical GPU-friendly algorithm code.
arXiv Detail & Related papers (2025-02-16T12:53:23Z) - Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.<n>NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.<n>In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.