Trusta: Reasoning about Assurance Cases with Formal Methods and Large
Language Models
- URL: http://arxiv.org/abs/2309.12941v1
- Date: Fri, 22 Sep 2023 15:42:43 GMT
- Title: Trusta: Reasoning about Assurance Cases with Formal Methods and Large
Language Models
- Authors: Zezhong Chen, Yuxin Deng, Wenjie Du
- Abstract summary: Trustworthiness Derivation Tree Analyzer (Trusta) is a desktop application designed to automatically construct and verify TDTs.
It has a built-in Prolog interpreter in its backend, and is supported by the constraint solvers Z3 and MONA.
Trusta can extract formal constraints from text in natural languages, facilitating an easier interpretation and validation process.
- Score: 4.005483185111992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Assurance cases can be used to argue for the safety of products in safety
engineering. In safety-critical areas, the construction of assurance cases is
indispensable. Trustworthiness Derivation Trees (TDTs) enhance assurance cases
by incorporating formal methods, rendering it possible for automatic reasoning
about assurance cases. We present Trustworthiness Derivation Tree Analyzer
(Trusta), a desktop application designed to automatically construct and verify
TDTs. The tool has a built-in Prolog interpreter in its backend, and is
supported by the constraint solvers Z3 and MONA. Therefore, it can solve
constraints about logical formulas involving arithmetic, sets, Horn clauses
etc. Trusta also utilizes large language models to make the creation and
evaluation of assurance cases more convenient. It allows for interactive human
examination and modification. We evaluated top language models like
ChatGPT-3.5, ChatGPT-4, and PaLM 2 for generating assurance cases. Our tests
showed a 50%-80% similarity between machine-generated and human-created cases.
In addition, Trusta can extract formal constraints from text in natural
languages, facilitating an easier interpretation and validation process. This
extraction is subject to human review and correction, blending the best of
automated efficiency with human insight. To our knowledge, this marks the first
integration of large language models in automatic creating and reasoning about
assurance cases, bringing a novel approach to a traditional challenge. Through
several industrial case studies, Trusta has proven to quickly find some subtle
issues that are typically missed in manual inspection, demonstrating its
practical value in enhancing the assurance case development process.
Related papers
- Query as Test: An Intelligent Driving Test and Data Storage Method for Integrated Cockpit-Vehicle-Road Scenarios [17.75264660582999]
Existing testing methods rely on data stacking, fail to cover all edge cases, and lack flexibility.<n>"Query as Test" (QaT) shifts the focus from rigid, prescripted test cases to flexible, on-demand logical queries.<n>"Extensible Scenarios Notations" (ESN) is a novel declarative data framework.
arXiv Detail & Related papers (2025-06-27T09:59:58Z) - Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure [1.5653612447564105]
We propose a compliance detection approach based on Natural Language Inference (NLI)<n>We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection.<n>Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.
arXiv Detail & Related papers (2025-06-10T11:56:06Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming [7.726159311658054]
We introduce Context-Free Grammars with Counters (CCFGs), a formalism that captures both syntactic and semantic structures in input specifications.<n>Using a fine-tuned CodeT5 model, we translate natural language input specifications into CCFGs, enabling the systematic generation of high-quality test cases.
arXiv Detail & Related papers (2025-05-21T02:48:01Z) - Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation [52.626086874715284]
We introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs.<n>By leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the formal verification process.<n>Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches.
arXiv Detail & Related papers (2025-05-08T13:29:46Z) - Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI.
We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts.
We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z) - Automatic Instantiation of Assurance Cases from Patterns Using Large Language Models [6.314768437420443]
Large Language Models (LLMs) can generate assurance cases that comply with specific patterns.
LLMs exhibit potential in the automatic generation of assurance cases, but their capabilities still fall short compared to human experts.
arXiv Detail & Related papers (2024-10-07T20:58:29Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Automating Semantic Analysis of System Assurance Cases using Goal-directed ASP [1.2189422792863451]
We present our approach to enhancing Assurance 2.0 with semantic rule-based analysis capabilities.
We examine the unique semantic aspects of assurance cases, such as logical consistency, adequacy, indefeasibility, etc.
arXiv Detail & Related papers (2024-08-21T15:22:43Z) - Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting [6.938766764201549]
This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques.
We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.
arXiv Detail & Related papers (2024-07-31T21:12:21Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - (Security) Assertions by Large Language Models [25.270188328436618]
We investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security.
We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt.
arXiv Detail & Related papers (2023-06-24T17:44:36Z) - On the Reliability and Explainability of Language Models for Program
Generation [15.569926313298337]
We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
arXiv Detail & Related papers (2023-02-19T14:59:52Z) - LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog.
Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.