Related papers: Verifying Memoryless Sequential Decision-making of Large Language Models

Verifying Memoryless Sequential Decision-making of Large Language Models

URL: http://arxiv.org/abs/2510.06756v1
Date: Wed, 08 Oct 2025 08:31:48 GMT
Title: Verifying Memoryless Sequential Decision-making of Large Language Models
Authors: Dennis Gross, Helge Spieker, Arnaud Gotlieb,
Abstract summary: We introduce a tool for rigorous and automated verification of large language model (LLM)-based policies in sequential decision-making tasks.<n>Given a Markov decision process (MDP) representing the sequential decision-making task, an LLM policy, and a safety requirement expressed as a PCTL formula, our approach incrementally constructs only the reachable portion of the MDP.<n>The resulting formal model is checked with Storm to determine whether the policy satisfies the specified safety property.
Score: 4.570003973862485
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a tool for rigorous and automated verification of large language model (LLM)- based policies in memoryless sequential decision-making tasks. Given a Markov decision process (MDP) representing the sequential decision-making task, an LLM policy, and a safety requirement expressed as a PCTL formula, our approach incrementally constructs only the reachable portion of the MDP guided by the LLM's chosen actions. Each state is encoded as a natural language prompt, the LLM's response is parsed into an action, and reachable successor states by the policy are expanded. The resulting formal model is checked with Storm to determine whether the policy satisfies the specified safety property. In experiments on standard grid world benchmarks, we show that open source LLMs accessed via Ollama can be verified when deterministically seeded, but generally underperform deep reinforcement learning baselines. Our tool natively integrates with Ollama and supports PRISM-specified tasks, enabling continuous benchmarking in user-specified sequential decision-making tasks and laying a practical foundation for formally verifying increasingly capable LLMs.

Related papers

A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation [2.5705703401045548]
We introduce an agentic large language model (LLM)-based framework for automated MDP modeling and policy generation (A-LAMP)<n>A-LAMP translates free-form natural language task descriptions into an MDP formulation and trained policy.<n>A-LAMP consistently achieves higher policy generation capability than a single state-of-the-art model.
arXiv Detail & Related papers (2025-12-12T04:21:17Z)
Towards Harnessing the Power of LLMs for ABAC Policy Mining [0.0468771281852187]
This paper presents an empirical investigation into the capabilities of Large Language Models (LLMs) to perform automated Attribute-based Access Control (ABAC) policy mining.<n>We evaluate the performance of some of the state-of-the-art LLMs, specifically Google Gemini (Flash and Pro) and OpenAI ChatGPT, as potential policy mining engines.
arXiv Detail & Related papers (2025-11-22T15:49:36Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Embodied CoT Distillation From LLM To Off-the-shelf Agents [6.318203525449058]
DeDer is a framework for decomposing and distilling the embodied reasoning capabilities from large language models (LLMs)<n>Our experiments with the ALFRED benchmark demonstrate that DeDer surpasses leading language planning and distillation approaches.
arXiv Detail & Related papers (2024-12-16T07:18:02Z)
RuAG: Learned-rule-augmented Generation for Large Language Models [62.64389390179651]
We propose a novel framework, RuAG, to automatically distill large volumes of offline data into interpretable first-order logic rules. We evaluate our framework on public and private industrial tasks, including natural language processing, time-series, decision-making, and industrial tasks.
arXiv Detail & Related papers (2024-11-04T00:01:34Z)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.<n>Existing evaluations tend to rely solely on a final success rate.<n>We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z)
DECIDER: A Dual-System Rule-Controllable Decoding Framework for Language Generation [57.07295906718989]
Constrained decoding approaches aim to control the meaning or style of text generated by pre-trained large language (Ms also PLMs) for various tasks at inference time.<n>These methods often guide plausible continuations by greedily and explicitly selecting targets.<n>Inspired by cognitive dual-process theory, we propose a novel decoding framework DECIDER.
arXiv Detail & Related papers (2024-03-04T11:49:08Z)
RoCar: A Relationship Network-based Evaluation Method for Large Language Models [20.954826722195847]
How to rationally evaluate the capabilities of large language models (LLMs) is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph. It is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.
arXiv Detail & Related papers (2023-07-29T14:47:07Z)
Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.