Related papers: FASTRIC: Prompt Specification Language for Verifiable LLM Interactions

FASTRIC: Prompt Specification Language for Verifiable LLM Interactions

URL: http://arxiv.org/abs/2512.18940v1
Date: Mon, 22 Dec 2025 01:19:50 GMT
Title: FASTRIC: Prompt Specification Language for Verifiable LLM Interactions
Authors: Wen-Long Jin,
Abstract summary: Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent.<n>We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts.
Score: 3.8073142980732997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent. We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts, enabling conformance verification through execution trace analysis. The LLM serves as intelligent execution agent: interpreting designer-encoded FSMs to execute specified behavioral roles. Unlike symbolic specification languages requiring parsers and compilers, FASTRIC leverages LLMs as unified infrastructure-simultaneously parser, interpreter, runtime environment, and development assistant. FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) structuring multi-turn interactions. Specification formality-ranging from implicit descriptions that frontier models infer to explicit step-by-step instructions for weaker models-serves as a design parameter. We introduce procedural conformance as verification metric measuring execution adherence to FSM specifications. Testing a 3-state kindergarten tutoring FSM across four formality levels and three model scales (14.7B, 685B, 1T+ parameters) reveals optimal specification formality is a function of model capacity. DeepSeek-V3.2 (685B) achieves perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaks at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) shows no stable optimum with high variance (SD=0.16-0.36). These findings reveal model-specific formality ranges-"Goldilocks zones"-where specifications provide sufficient structure without over-constraint, establishing Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees.

Related papers

Evaluating LLM-Generated ACSL Annotations for Formal Verification [0.0]
This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance.<n>Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models.<n>All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers.
arXiv Detail & Related papers (2026-02-14T19:18:34Z)
Bridging Natural Language and Formal Specification--Automated Translation of Software Requirements to LTL via Hierarchical Semantics Decomposition Using LLMs [10.958536923155101]
Req2LTL is a modular framework that bridges NL and Linear Temporal Logic.<n>It achieves 88.4% semantic accuracy and 100% syntactic correctness on real-world aerospace requirements.
arXiv Detail & Related papers (2025-12-19T08:25:54Z)
Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation [8.398818816613806]
We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration.<n>Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters.
arXiv Detail & Related papers (2025-12-11T10:22:02Z)
Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages [61.18573330164572]
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time.<n>This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior.
arXiv Detail & Related papers (2025-12-02T14:54:54Z)
Synthesizing Precise Protocol Specs from Natural Language for Effective Test Generation [42.582977261473324]
AutoSPEC recovers on average 92.8% of client and 80.2% of server message types, and 81.5% message acceptance across diverse real-world systems.<n>Prototype successfully demonstrated the feasibility of our approach on five widely used.<n>internet-based protocols.
arXiv Detail & Related papers (2025-11-22T08:39:52Z)
PAT-Agent: Autoformalization for Model Checking [17.082027022913998]
PAT-Agent is an end-to-end framework for natural language autoformalization and formal model repair.<n>It combines the generative capabilities of large language models with the rigor of formal verification.
arXiv Detail & Related papers (2025-09-28T06:32:14Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
Self-Steering Language Models [113.96916935955842]
DisCIPL is a method for "self-steering" language models (LMs)<n>DisCIPL generates a task-specific inference program that is executed by a population of Follower models.<n>Our work opens up a design space of highly-parallelized Monte Carlo inference strategies.
arXiv Detail & Related papers (2025-04-09T17:54:22Z)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.