Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages
- URL: http://arxiv.org/abs/2406.09757v2
- Date: Tue, 15 Oct 2024 21:44:50 GMT
- Title: Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages
- Authors: Shuvendu K. Lahiri,
- Abstract summary: We propose a metric for evaluating the quality of specifications for verification-aware languages.
We show that our metric agrees closely on a human-labeled dataset of Dafny specifications for the popular MBPP code-generation benchmark.
We also outline formal verification challenges that need to be addressed to apply the technique more widely.
- Score: 6.0608817611709735
- License:
- Abstract: Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of a program. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the {\it user-intent formalization for programs}, expressed as a formal specification. This is because intent or requirement is expressed {\it informally} in natural language and the specification is a formal artefact. Despite, the advent of large language models (LLMs) has made tremendous strides bridging the gap between informal intent and formal program implementations recently, driven in large parts by benchmarks and automated metrics for evaluation. Recent work has proposed a framework for evaluating the {\it user-intent formalization} problem for mainstream programming languages~\cite{endres-fse24}. However, such an approach does not readily extend to verification-aware languages that support rich specifications (using quantifiers and ghost variables) that cannot be evaluated through dynamic execution. Previous work also required generating program mutants using LLMs to create the benchmark. We advocate an alternate, perhaps simpler approach of {\it symbolically testing specifications} to provide an intuitive metric for evaluating the quality of specifications for verification-aware languages. We demonstrate that our automated metric agrees closely on a human-labeled dataset of Dafny specifications for the popular MBPP code-generation benchmark, yet demonstrates cases where the human labeling is not perfect. We also outline formal verification challenges that need to be addressed to apply the technique more widely. We believe our work provides a stepping stone to enable the establishment of a benchmark and research agenda for the problem of user-intent formalization for programs.
Related papers
- An Agile Formal Specification Language Design Based on K Framework [11.042686307366818]
This paper introduces an Agile Formal Specification Language (ASL)
The design of ASL incorporates agile design principles, making the writing of formal specifications simpler, more efficient, and scalable.
A specification translation algorithm is developed, capable of converting ASL into K formal specification language that can be executed for verification.
arXiv Detail & Related papers (2024-04-29T09:00:50Z) - NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs.
We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z) - An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications [5.395614997568524]
This paper presents a systematic investigation into the capacity of Large Language Models (LLMs) for repairing declarative specifications in Alloy.
We propose a novel repair pipeline that integrates a dual-agent LLM framework, comprising a Repair Agent and a Prompt Agent.
Our study reveals that LLMs, particularly GPT-4 variants, outperform existing techniques in terms of repair efficacy, albeit with a marginal increase in runtime and token usage.
arXiv Detail & Related papers (2024-04-17T03:46:38Z) - Enchanting Program Specification Synthesis by Large Language Models using Static Analysis and Program Verification [15.686651364655958]
AutoSpec is an automated approach to synthesize specifications for automated program verification.
It overcomes the shortcomings of existing work in specification versatility, synthesizing satisfiable and adequate specifications for full proof.
It can be successfully applied to verify the programs in a real-world X509-parser project.
arXiv Detail & Related papers (2024-03-31T18:15:49Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies.
We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings.
We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z) - Automatic Program Instrumentation for Automatic Verification (Extended
Technical Report) [0.0]
In deductive verification and software model checking, dealing with certain specification language constructs can be problematic.
We propose instrumentation as a unifying verification paradigm that subsumes various existing ad-hoc approaches.
We implement our approach in the MonoCera tool, which is tailored to the verification of programs with aggregation.
arXiv Detail & Related papers (2023-05-26T14:55:35Z) - nl2spec: Interactively Translating Unstructured Natural Language to
Temporal Logics with Large Language Models [3.1143846686797314]
We present nl2spec, a framework for applying Large Language Models (LLMs) derive formal specifications from unstructured natural language.
We introduce a new methodology to detect and resolve the inherent ambiguity of system requirements in natural language.
Users iteratively add, delete, and edit these sub-translations to amend erroneous formalizations, which is easier than manually redrafting the entire formalization.
arXiv Detail & Related papers (2023-03-08T20:08:53Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.