AutoJudge: Judge Decoding Without Manual Annotation
- URL: http://arxiv.org/abs/2504.20039v1
- Date: Mon, 28 Apr 2025 17:59:28 GMT
- Title: AutoJudge: Judge Decoding Without Manual Annotation
- Authors: Roman Garipov, Fedor Velikonivtsev, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin,
- Abstract summary: AutoJudge is a framework that accelerates large language model (LLM) inference with task-specific lossy speculative decoding.<n>We use a semi-greedy search algorithm to test which of the mismatches between target and draft model should be corrected.<n>We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted.
- Score: 10.411318392966358
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce AutoJudge, a framework that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the generated response, relaxing the guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft model should be corrected to preserve quality, and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We test our approach with Llama 3.2 1B (draft) and Llama 3.1 8B (target) models on zero-shot GSM8K reasoning, where it achieves up to 1.5x more accepted tokens per verification cycle with under 1% degradation in answer accuracy compared to standard speculative decoding and over 2x with small loss in accuracy. When applied to the LiveCodeBench benchmark, our approach automatically detects other, programming-specific important tokens and shows similar speedups, demonstrating its ability to generalize across tasks.
Related papers
- Robust Conformal Prediction with a Single Binary Certificate [58.450154976190795]
Conformal prediction (CP) converts any model's output to prediction sets with a guarantee to cover the true label with (adjustable) high probability.
We propose a robust conformal prediction that produces smaller sets even with significantly lower MC samples.
arXiv Detail & Related papers (2025-03-07T08:41:53Z) - GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [52.905060461479856]
GRIFFIN is a framework that incorporates a token-alignable training strategy and a token-alignable draft model.
Experiments on LLaMA-series and Vicuna models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 7% and a speedup ratio exceeding 8%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z) - Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment [25.988070517700848]
Speculative decoding has been proposed as a technique to accelerate autoregressive generation.
We show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates.
We ask the following question: Can we adapt verification to recognize correct, but non-aligned replies?
arXiv Detail & Related papers (2025-01-31T17:09:53Z) - Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation [43.09801987385207]
Contrastive Language-Image Pretraining (CLIP) excels at learning generalizable image representations but often falls short in zero-shot inference on certain datasets.<n>Test-time adaptation (TTA) mitigates this issue by adjusting components like normalization layers or context prompts, yet it typically requires large batch sizes and extensive augmentations.<n>We propose Token Condensation as Adaptation (TCA), a training-free adaptation method that takes a step beyond standard TC.
arXiv Detail & Related papers (2024-10-16T07:13:35Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens [4.5888031410244885]
We propose an acceleration scheme for large language models (LLMs) through Speculative Decoding with Semantic Adaptive Tokens (SDSAT)
The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more accurately without compromising its accuracy.
Experiments conducted on the CodeLlama-13B and 7B models have yielded speed increases of over 3.5X and 3.0X, respectively.
arXiv Detail & Related papers (2024-03-27T14:54:27Z) - Block Verification Accelerates Speculative Decoding [23.764655044837113]
Speculative decoding uses a fast model to draft a block of tokens which are verified in parallel by the target model.<n>In prior works, draft verification is performed independently token-by-token.<n>We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly.
arXiv Detail & Related papers (2024-03-15T16:28:22Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech
Recognition [49.42732949233184]
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition.
Taking noisy labels as ground-truth in the loss function results in suboptimal performance.
We propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels.
arXiv Detail & Related papers (2023-08-12T12:13:52Z) - Certified Robustness to Label-Flipping Attacks via Randomized Smoothing [105.91827623768724]
Machine learning algorithms are susceptible to data poisoning attacks.
We present a unifying view of randomized smoothing over arbitrary functions.
We propose a new strategy for building classifiers that are pointwise-certifiably robust to general data poisoning attacks.
arXiv Detail & Related papers (2020-02-07T21:28:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.