Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation
- URL: http://arxiv.org/abs/2508.00912v1
- Date: Tue, 29 Jul 2025 19:50:55 GMT
- Title: Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation
- Authors: Ziyao Wang, Guoheng Sun, Yexiao He, Zheyu Shen, Bowei Tian, Ang Li,
- Abstract summary: Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token.<n> PALACE estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces.<n>Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy.
- Score: 7.928002407828304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token, including those from hidden intermediate steps, raising concerns of token inflation and potential overbilling. This gap underscores the urgent need for reliable token auditing, yet achieving it is far from straightforward: cryptographic verification (e.g., hash-based signature) offers little assurance when providers control the entire execution pipeline, while user-side prediction struggles with the inherent variance of reasoning LLMs, where token usage fluctuates across domains and prompt styles. To bridge this gap, we present PALACE (Predictive Auditing of LLM APIs via Reasoning Token Count Estimation), a user-side framework that estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces. PALACE introduces a GRPO-augmented adaptation module with a lightweight domain router, enabling dynamic calibration across diverse reasoning tasks and mitigating variance in token usage patterns. Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy, supporting both fine-grained cost auditing and inflation detection. Taken together, PALACE represents an important first step toward standardized predictive auditing, offering a practical path to greater transparency, accountability, and user trust.
Related papers
- IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation [49.796717294455796]
We present IMMACULATE, a practical auditing framework that detects economically motivated deviations.<n>IMMACULATE selectively audits a small fraction of requests using verifiable computation, achieving strong detection guarantees while amortizing cryptographic overhead.
arXiv Detail & Related papers (2026-02-26T07:21:02Z) - Visualizing token importance for black-box language models [48.747801442240565]
We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings.<n>We propose Distribution-Based Sensitivity Analysis (DBSA) to evaluate the sensitivity of the output of a language model for each input token.
arXiv Detail & Related papers (2025-12-12T14:01:43Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z) - FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z) - Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate [0.19676943624884313]
Hallucinations in Large Language Model (LLM) outputs for Question Answering tasks critically undermine their real-world reliability.<n>This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access.<n>Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding.
arXiv Detail & Related papers (2025-09-01T13:34:21Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Causal Prompting for Implicit Sentiment Analysis with Large Language Models [21.39152516811571]
Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated.<n>Recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA.<n>We propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning.
arXiv Detail & Related papers (2025-07-01T03:01:09Z) - IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation [70.2753541780788]
We introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding.<n>IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
arXiv Detail & Related papers (2025-06-16T08:28:19Z) - Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services [22.700907666937177]
This position paper highlights emerging accountability challenges in commercial Opaque LLM Services (COLS)<n>We formalize two key risks: textitquantity inflation, where token and call counts may be artificially inflated, and textitquality downgrade, where providers might quietly substitute lower-cost models or tools.<n>We propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals.
arXiv Detail & Related papers (2025-05-24T02:26:49Z) - CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs [13.31195673556853]
We propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens.<n>Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%.
arXiv Detail & Related papers (2025-05-19T23:39:23Z) - Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge.<n>Users pay for services based on advertised model capabilities.<n> providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs.<n>This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z) - Learning on LLM Output Signatures for gray-box Behavior Analysis [52.81120759532526]
Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited.<n>We develop a transformer-based approach to process contamination and data detection in gray-box settings.<n>Our approach achieves superior performance on hallucination and data detection in gray-box settings, significantly outperforming existing baselines.
arXiv Detail & Related papers (2025-03-18T09:04:37Z) - Forking Paths in Neural Text Generation [14.75166317633176]
We develop a novel approach to representing uncertainty dynamics across individual tokens of text generation.<n>We use our method to analyze LLM responses on 7 different tasks across 4 domains.<n>We find many examples of forking tokens, including surprising ones such as punctuation marks.
arXiv Detail & Related papers (2024-12-10T22:57:57Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.