Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation
- URL: http://arxiv.org/abs/2508.00912v1
- Date: Tue, 29 Jul 2025 19:50:55 GMT
- Title: Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation
- Authors: Ziyao Wang, Guoheng Sun, Yexiao He, Zheyu Shen, Bowei Tian, Ang Li,
- Abstract summary: Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token.<n> PALACE estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces.<n>Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy.
- Score: 7.928002407828304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token, including those from hidden intermediate steps, raising concerns of token inflation and potential overbilling. This gap underscores the urgent need for reliable token auditing, yet achieving it is far from straightforward: cryptographic verification (e.g., hash-based signature) offers little assurance when providers control the entire execution pipeline, while user-side prediction struggles with the inherent variance of reasoning LLMs, where token usage fluctuates across domains and prompt styles. To bridge this gap, we present PALACE (Predictive Auditing of LLM APIs via Reasoning Token Count Estimation), a user-side framework that estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces. PALACE introduces a GRPO-augmented adaptation module with a lightweight domain router, enabling dynamic calibration across diverse reasoning tasks and mitigating variance in token usage patterns. Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy, supporting both fine-grained cost auditing and inflation detection. Taken together, PALACE represents an important first step toward standardized predictive auditing, offering a practical path to greater transparency, accountability, and user trust.
Related papers
- CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Causal Prompting for Implicit Sentiment Analysis with Large Language Models [21.39152516811571]
Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated.<n>Recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA.<n>We propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning.
arXiv Detail & Related papers (2025-07-01T03:01:09Z) - IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation [70.2753541780788]
We introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding.<n>IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
arXiv Detail & Related papers (2025-06-16T08:28:19Z) - Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services [22.700907666937177]
This position paper highlights emerging accountability challenges in commercial Opaque LLM Services (COLS)<n>We formalize two key risks: textitquantity inflation, where token and call counts may be artificially inflated, and textitquality downgrade, where providers might quietly substitute lower-cost models or tools.<n>We propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals.
arXiv Detail & Related papers (2025-05-24T02:26:49Z) - CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs [13.31195673556853]
We propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens.<n>Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%.
arXiv Detail & Related papers (2025-05-19T23:39:23Z) - Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge.<n>Users pay for services based on advertised model capabilities.<n> providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs.<n>This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z) - Learning on LLM Output Signatures for gray-box Behavior Analysis [52.81120759532526]
Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited.<n>We develop a transformer-based approach to process contamination and data detection in gray-box settings.<n>Our approach achieves superior performance on hallucination and data detection in gray-box settings, significantly outperforming existing baselines.
arXiv Detail & Related papers (2025-03-18T09:04:37Z) - Forking Paths in Neural Text Generation [14.75166317633176]
We develop a novel approach to representing uncertainty dynamics across individual tokens of text generation.<n>We use our method to analyze LLM responses on 7 different tasks across 4 domains.<n>We find many examples of forking tokens, including surprising ones such as punctuation marks.
arXiv Detail & Related papers (2024-12-10T22:57:57Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.