Log Probability Tracking of LLM APIs
- URL: http://arxiv.org/abs/2512.03816v1
- Date: Wed, 03 Dec 2025 14:03:43 GMT
- Title: Log Probability Tracking of LLM APIs
- Authors: Timothée Chauvin, Erwan Le Merrer, François Taïani, Gilles Tredan,
- Abstract summary: Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs.<n>We show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring.<n>We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.
- Score: 4.58249696848172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.
Related papers
- Visualizing token importance for black-box language models [48.747801442240565]
We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings.<n>We propose Distribution-Based Sensitivity Analysis (DBSA) to evaluate the sensitivity of the output of a language model for each input token.
arXiv Detail & Related papers (2025-12-12T14:01:43Z) - Statistical Independence Aware Caching for LLM Workflows [3.700239041804401]
Local caching of responses offers a practical solution to reduce the cost and latency of large language models (LLMs) inference.<n>Existing LLM caching systems lack ways to enforce statistical independence constraints.<n>We introduce Mnimi, a cache design pattern that supports modular LLM while ensuring statistical integrity at the component level.
arXiv Detail & Related papers (2025-11-27T05:16:28Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate [0.19676943624884313]
Hallucinations in Large Language Model (LLM) outputs for Question Answering tasks critically undermine their real-world reliability.<n>This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access.<n>Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding.
arXiv Detail & Related papers (2025-09-01T13:34:21Z) - Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test [24.393978712663618]
API providers may discreetly serve quantized or fine-tuned variants to reduce costs or maliciously alter model behaviors.<n>We propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model.<n>We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution.
arXiv Detail & Related papers (2025-06-08T03:00:31Z) - Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach [8.646131951484696]
AuditLLM is a novel tool designed to audit the performance of various Large Language Models (LLMs) in a methodical way.
A robust, reliable, and consistent LLM is expected to generate semantically similar responses to variably phrased versions of the same question.
A certain level of inconsistency has been shown to be an indicator of potential bias, hallucinations, and other issues.
arXiv Detail & Related papers (2024-02-14T17:31:04Z) - (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for
Evolving LLM APIs [8.403074015356594]
Large Language Models (LLMs) are increasingly integrated into software applications.
LLMs are often updated silently and scheduled to be deprecated.
This can cause performance regression and affect prompt design choices.
arXiv Detail & Related papers (2023-11-18T17:11:12Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning
and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency.
We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question.
Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.