Related papers: Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs

Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs

URL: http://arxiv.org/abs/2602.04099v1
Date: Wed, 04 Feb 2026 00:34:27 GMT
Title: Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs
Authors: Letian Cheng, Junyan Wang, Yan Gao, Elliott Wen, Ting Dang, Hong Jia,
Abstract summary: We introduce LengthBenchmark, a system-conscious evaluation framework that integrates input length, evaluation protocol design, and system-level costs.<n>Unlike prior work that focuses solely on accuracy-oriented metrics, LengthBenchmark additionally measures latency, memory footprint, and evaluation cost.<n>Our analysis yields two key observations: (i) sliding window evaluation consistently inflates performance on short inputs, and (ii) both full-precision and quantized models appear to realise gains as the evaluated segment length grows.
Score: 12.220738199786007
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Perplexity is a widely adopted metric for assessing the predictive quality of large language models (LLMs) and often serves as a reference metric for downstream evaluations. However, recent evidence shows that perplexity can be unreliable, especially when irrelevant long inputs are used, raising concerns for both benchmarking and system deployment. While prior efforts have employed selective input filtering and curated datasets, the impact of input length on perplexity has not been systematically studied from a systems perspective and input length has rarely been treated as a first-class system variable affecting both fairness and efficiency. In this work, we close this gap by introducing LengthBenchmark, a system-conscious evaluation framework that explicitly integrates input length, evaluation protocol design, and system-level costs, evaluating representative LLMs under two scoring protocols (direct accumulation and fixed window sliding) across varying context lengths. Unlike prior work that focuses solely on accuracy-oriented metrics, LengthBenchmark additionally measures latency, memory footprint, and evaluation cost, thereby linking predictive metrics to deployment realities. We further incorporate quantized variants not as a main contribution, but as robustness checks, showing that length-induced biases persist across both full-precision and compressed models. This design disentangles the effects of evaluation logic, quantization, and input length, and demonstrates that length bias is a general phenomenon that undermines fair cross-model comparison. Our analysis yields two key observations: (i) sliding window evaluation consistently inflates performance on short inputs, and (ii) both full-precision and quantized models appear to realise gains as the evaluated segment length grows.

Related papers

Predicting LLM Output Length via Entropy-Guided Representations [13.351384070796747]
We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction.<n>Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction.
arXiv Detail & Related papers (2026-02-12T10:49:04Z)
Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics [25.374192139098284]
We study prototypicality bias as a systematic failure mode in multimodal evaluation.<n>We introduce a controlled contrastive benchmark ProtoBias, spanning Animals, Objects, and Demography images.<n>Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs.<n>We propose ProtoScore, a robust 7B- parameter metric that substantially reduces failure rates and suppresses misranking.
arXiv Detail & Related papers (2026-01-08T13:49:14Z)
Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
Bayesian Evaluation of Large Language Model Behavior [11.847752638476257]
It is increasingly important to evaluate how text generation systems based on large language models behave.<n>Existing approaches to evaluation often neglect statistical uncertainty quantification.<n>We present two case studies applying a Bayesian approach for quantifying uncertainty in binary evaluation metrics.
arXiv Detail & Related papers (2025-11-04T19:51:46Z)
Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach [118.75896764188424]
We present a novel perspective to expose the inherent size sensitivity of existing widely used Salient Object Detection metrics.<n>To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed.<n>We further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes.
arXiv Detail & Related papers (2025-09-19T04:12:14Z)
Explaining Length Bias in LLM-Based Preference Evaluations [52.141933285905885]
We decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass.<n>We show that response length impacts evaluations by influencing information mass.<n>We propose AdapAlpaca, a simple yet effective adjustment to win rate measurement.
arXiv Detail & Related papers (2024-07-01T08:37:41Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a new Query performance prediction (QPP) framework using automatically generated relevance judgments (QPP-GenRE)<n>QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.<n>We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific relevance.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Exploring validation metrics for offline model-based optimisation with diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)
The Counterfactual-Shapley Value: Attributing Change in System Metrics [10.804568364995982]
A key component of an attribution question is estimating counterfactual: the (hypothetical) change in the system metric due to a specified change in a single input. We propose a method to estimate counterfactuals using time-series predictive models and construct an attribution score, CF-Shapley. As a real-world application, we analyze a query-ad matching system with the goal of attributing observed change in a metric for ad matching density.
arXiv Detail & Related papers (2022-08-17T16:48:20Z)
Unveiling Project-Specific Bias in Neural Code Models [20.131797671630963]
Large Language Models (LLMs) based neural code models often struggle to generalize effectively to real-world inter-project out-of-distribution (OOD) data. We show that this phenomenon is caused by the heavy reliance on project-specific shortcuts for prediction instead of ground-truth evidence. We propose a novel bias mitigation mechanism that regularizes the model's learning behavior by leveraging latent logic relations among samples.
arXiv Detail & Related papers (2022-01-19T02:09:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.