Related papers: Bounded PCTL Model Checking of Large Language Model Outputs

Bounded PCTL Model Checking of Large Language Model Outputs

URL: http://arxiv.org/abs/2509.18836v1
Date: Tue, 23 Sep 2025 09:19:37 GMT
Title: Bounded PCTL Model Checking of Large Language Model Outputs
Authors: Dennis Gross, Helge Spieker, Arnaud Gotlieb,
Abstract summary: We show that only a limited number of tokens are typically chosen during text generation, which are not always the same.<n>This insight drives the creation of $alpha$-$k$-bounded text generation.<n>Our verification method considers an initial string and the subsequent top-$k$ tokens.
Score: 4.570003973862485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce LLMCHECKER, a model-checking-based verification method to verify the probabilistic computation tree logic (PCTL) properties of an LLM text generation process. We empirically show that only a limited number of tokens are typically chosen during text generation, which are not always the same. This insight drives the creation of $\alpha$-$k$-bounded text generation, narrowing the focus to the $\alpha$ maximal cumulative probability on the top-$k$ tokens at every step of the text generation process. Our verification method considers an initial string and the subsequent top-$k$ tokens while accommodating diverse text quantification methods, such as evaluating text quality and biases. The threshold $\alpha$ further reduces the selected tokens, only choosing those that exceed or meet it in cumulative probability. LLMCHECKER then allows us to formally verify the PCTL properties of $\alpha$-$k$-bounded LLMs. We demonstrate the applicability of our method in several LLMs, including Llama, Gemma, Mistral, Genstruct, and BERT. To our knowledge, this is the first time PCTL-based model checking has been used to check the consistency of the LLM text generation process.

Related papers

Decoding-Free Sampling Strategies for LLM Marginalization [15.214953630908477]
Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage.<n>We investigate sampling strategies that are decoding-free, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic.
arXiv Detail & Related papers (2025-10-23T04:50:14Z)
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns [50.401907401444404]
Large language models (LLMs) are crucial for preventing misuse and building trustworthy AI systems.<n>We propose RepreGuard, an efficient statistics-based detection method.<n> Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios.
arXiv Detail & Related papers (2025-08-18T17:59:15Z)
Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z)
GEM: Empowering LLM for both Embedding Generation and Language Understanding [11.081595808236239]
We propose Generative Embedding large language Model (GEM) to generate high-quality text embeddings.<n>Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask.<n>Our results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance.
arXiv Detail & Related papers (2025-06-04T18:02:07Z)
Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models [92.92512796044471]
We propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs)<n>We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension"<n>We introduce a novel unsupervised method, termed LLACA, which enables the construction of a dynamic $n$-gram model that adjusts based on contextual information.
arXiv Detail & Related papers (2025-05-26T07:48:15Z)
Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities [13.657259851747126]
Verifying provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc.<n>This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content.<n>In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not?<n>We model LLM-generated text as a sequential process with complete dependence on history. We then design zero-shot statistical tests to distinguish between text generated by two different known sets of LLM
arXiv Detail & Related papers (2025-01-04T23:51:43Z)
M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models [12.96619003056978]
This paper presents a novel multi-prompt ensemble decoding approach designed to bolster the generation quality of Large Language Models.<n>Given a unique input $X$, we submit $n$ variations of prompts with $X$ to LLMs in batch mode to decode and derive probability distributions.<n>For each token prediction, we calculate the ensemble probability by averaging the $n$ probability distributions within the batch, utilizing this aggregated probability to generate the token.
arXiv Detail & Related papers (2024-12-24T09:06:58Z)
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs [3.631341123338476]
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step.<n>We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by using the top token's probability as a scaling factor.
arXiv Detail & Related papers (2024-07-01T08:37:25Z)
Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore [51.65730053591696]
We propose a simple yet effective black-box zero-shot detection approach based on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts.<n> Experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods.
arXiv Detail & Related papers (2024-05-07T12:57:01Z)
SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs) We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z)
DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection [56.513637720967566]
Large language models (LLMs) can generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets. Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics. We propose to extract deep intrinsic characteristics of the black-box model generated texts.
arXiv Detail & Related papers (2023-05-21T17:26:16Z)
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function. We then define a new curvature-based criterion for judging if a passage is generated from a given LLM. We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.