Bounded PCTL Model Checking of Large Language Model Outputs
- URL: http://arxiv.org/abs/2509.18836v1
- Date: Tue, 23 Sep 2025 09:19:37 GMT
- Title: Bounded PCTL Model Checking of Large Language Model Outputs
- Authors: Dennis Gross, Helge Spieker, Arnaud Gotlieb,
- Abstract summary: We show that only a limited number of tokens are typically chosen during text generation, which are not always the same.<n>This insight drives the creation of $alpha$-$k$-bounded text generation.<n>Our verification method considers an initial string and the subsequent top-$k$ tokens.
- Score: 4.570003973862485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce LLMCHECKER, a model-checking-based verification method to verify the probabilistic computation tree logic (PCTL) properties of an LLM text generation process. We empirically show that only a limited number of tokens are typically chosen during text generation, which are not always the same. This insight drives the creation of $\alpha$-$k$-bounded text generation, narrowing the focus to the $\alpha$ maximal cumulative probability on the top-$k$ tokens at every step of the text generation process. Our verification method considers an initial string and the subsequent top-$k$ tokens while accommodating diverse text quantification methods, such as evaluating text quality and biases. The threshold $\alpha$ further reduces the selected tokens, only choosing those that exceed or meet it in cumulative probability. LLMCHECKER then allows us to formally verify the PCTL properties of $\alpha$-$k$-bounded LLMs. We demonstrate the applicability of our method in several LLMs, including Llama, Gemma, Mistral, Genstruct, and BERT. To our knowledge, this is the first time PCTL-based model checking has been used to check the consistency of the LLM text generation process.
Related papers
- Decoding-Free Sampling Strategies for LLM Marginalization [15.214953630908477]
Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage.<n>We investigate sampling strategies that are decoding-free, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic.
arXiv Detail & Related papers (2025-10-23T04:50:14Z) - RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns [50.401907401444404]
Large language models (LLMs) are crucial for preventing misuse and building trustworthy AI systems.<n>We propose RepreGuard, an efficient statistics-based detection method.<n> Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios.
arXiv Detail & Related papers (2025-08-18T17:59:15Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - GEM: Empowering LLM for both Embedding Generation and Language Understanding [11.081595808236239]
We propose Generative Embedding large language Model (GEM) to generate high-quality text embeddings.<n>Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask.<n>Our results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance.
arXiv Detail & Related papers (2025-06-04T18:02:07Z) - Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models [92.92512796044471]
We propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs)<n>We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension"<n>We introduce a novel unsupervised method, termed LLACA, which enables the construction of a dynamic $n$-gram model that adjusts based on contextual information.
arXiv Detail & Related papers (2025-05-26T07:48:15Z) - Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities [13.657259851747126]
Verifying provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc.<n>This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content.<n>In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not?<n>We model LLM-generated text as a sequential process with complete dependence on history. We then design zero-shot statistical tests to distinguish between text generated by two different known sets of LLM
arXiv Detail & Related papers (2025-01-04T23:51:43Z) - M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models [12.96619003056978]
This paper presents a novel multi-prompt ensemble decoding approach designed to bolster the generation quality of Large Language Models.<n>Given a unique input $X$, we submit $n$ variations of prompts with $X$ to LLMs in batch mode to decode and derive probability distributions.<n>For each token prediction, we calculate the ensemble probability by averaging the $n$ probability distributions within the batch, utilizing this aggregated probability to generate the token.
arXiv Detail & Related papers (2024-12-24T09:06:58Z) - Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs [3.631341123338476]
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step.<n>We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by using the top token's probability as a scaling factor.
arXiv Detail & Related papers (2024-07-01T08:37:25Z) - Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore [51.65730053591696]
We propose a simple yet effective black-box zero-shot detection approach based on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts.<n> Experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods.
arXiv Detail & Related papers (2024-05-07T12:57:01Z) - SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs)
We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z) - DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection [56.513637720967566]
Large language models (LLMs) can generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets.
Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics.
We propose to extract deep intrinsic characteristics of the black-box model generated texts.
arXiv Detail & Related papers (2023-05-21T17:26:16Z) - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability
Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function.
We then define a new curvature-based criterion for judging if a passage is generated from a given LLM.
We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.