Trusta: Reasoning about Assurance Cases with Formal Methods and Large
Language Models
- URL: http://arxiv.org/abs/2309.12941v1
- Date: Fri, 22 Sep 2023 15:42:43 GMT
- Title: Trusta: Reasoning about Assurance Cases with Formal Methods and Large
Language Models
- Authors: Zezhong Chen, Yuxin Deng, Wenjie Du
- Abstract summary: Trustworthiness Derivation Tree Analyzer (Trusta) is a desktop application designed to automatically construct and verify TDTs.
It has a built-in Prolog interpreter in its backend, and is supported by the constraint solvers Z3 and MONA.
Trusta can extract formal constraints from text in natural languages, facilitating an easier interpretation and validation process.
- Score: 4.005483185111992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Assurance cases can be used to argue for the safety of products in safety
engineering. In safety-critical areas, the construction of assurance cases is
indispensable. Trustworthiness Derivation Trees (TDTs) enhance assurance cases
by incorporating formal methods, rendering it possible for automatic reasoning
about assurance cases. We present Trustworthiness Derivation Tree Analyzer
(Trusta), a desktop application designed to automatically construct and verify
TDTs. The tool has a built-in Prolog interpreter in its backend, and is
supported by the constraint solvers Z3 and MONA. Therefore, it can solve
constraints about logical formulas involving arithmetic, sets, Horn clauses
etc. Trusta also utilizes large language models to make the creation and
evaluation of assurance cases more convenient. It allows for interactive human
examination and modification. We evaluated top language models like
ChatGPT-3.5, ChatGPT-4, and PaLM 2 for generating assurance cases. Our tests
showed a 50%-80% similarity between machine-generated and human-created cases.
In addition, Trusta can extract formal constraints from text in natural
languages, facilitating an easier interpretation and validation process. This
extraction is subject to human review and correction, blending the best of
automated efficiency with human insight. To our knowledge, this marks the first
integration of large language models in automatic creating and reasoning about
assurance cases, bringing a novel approach to a traditional challenge. Through
several industrial case studies, Trusta has proven to quickly find some subtle
issues that are typically missed in manual inspection, demonstrating its
practical value in enhancing the assurance case development process.
Related papers
- Automatic Instantiation of Assurance Cases from Patterns Using Large Language Models [6.314768437420443]
Large Language Models (LLMs) can generate assurance cases that comply with specific patterns.
LLMs exhibit potential in the automatic generation of assurance cases, but their capabilities still fall short compared to human experts.
arXiv Detail & Related papers (2024-10-07T20:58:29Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Automating Semantic Analysis of System Assurance Cases using Goal-directed ASP [1.2189422792863451]
We present our approach to enhancing Assurance 2.0 with semantic rule-based analysis capabilities.
We examine the unique semantic aspects of assurance cases, such as logical consistency, adequacy, indefeasibility, etc.
arXiv Detail & Related papers (2024-08-21T15:22:43Z) - Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting [6.938766764201549]
This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques.
We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.
arXiv Detail & Related papers (2024-07-31T21:12:21Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - (Security) Assertions by Large Language Models [25.270188328436618]
We investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security.
We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt.
arXiv Detail & Related papers (2023-06-24T17:44:36Z) - On the Reliability and Explainability of Language Models for Program
Generation [15.569926313298337]
We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
arXiv Detail & Related papers (2023-02-19T14:59:52Z) - LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog.
Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.