Related papers: Validating Traces of Distributed Programs Against TLA+ Specifications

Validating Traces of Distributed Programs Against TLA+ Specifications

URL: http://arxiv.org/abs/2404.16075v2
Date: Tue, 17 Sep 2024 16:52:15 GMT
Title: Validating Traces of Distributed Programs Against TLA+ Specifications
Authors: Horatiu Cirstea, Markus A. Kuppe, Benjamin Loillier, Stephan Merz,
Abstract summary: We present a framework for relating traces of distributed programs to high-level specifications written in TLA+. The problem is reduced to a constrained model checking problem, realized using the TLC model checker. We have applied our approach to several distributed programs, detecting discrepancies between the specifications and the implementations in all cases.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: TLA+ is a formal language for specifying systems, including distributed algorithms, that is supported by powerful verification tools. In this work we present a framework for relating traces of distributed programs to high-level specifications written in TLA+. The problem is reduced to a constrained model checking problem, realized using the TLC model checker. Our framework consists of an API for instrumenting Java programs in order to record traces of executions, of a collection of TLA+ operators that are used for relating those traces to specifications, and of scripts for running the model checker. Crucially, traces only contain updates to specification variables rather than full values, and developers may choose to trace only certain variables. We have applied our approach to several distributed programs, detecting discrepancies between the specifications and the implementations in all cases. We discuss reasons for these discrepancies, best practices for instrumenting programs, and how to interpret the verdict produced by TLC.

Related papers

CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
Large Language Models as Realistic Microservice Trace Generators [54.85489678342595]
Workload traces are essential to understand complex computer systems' behavior and manage processing and memory resources. This paper proposes a first-of-a-kind approach that relies on training a large language model to generate synthetic workload traces. Our model adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.
arXiv Detail & Related papers (2024-12-16T12:48:04Z)
TGOSPA Metric Parameters Selection and Evaluation for Visual Multi-object Tracking [4.998475411100799]
We use the trajectory generalized optimal sub-pattern assignment (TGOSPA) metric to evaluate multi-object tracking performance. It accounts for localization errors, the number of missed and false objects, and the number of track switches. By exploring the TGOSPA parameter selection, we enable users to compare, comprehend, and optimize the performance of algorithms tailored for specific tasks.
arXiv Detail & Related papers (2024-12-11T11:57:05Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast [5.019791860882564]
Large language models (LLMs) have shown promise in a number of software engineering activities. This paper explores the effectiveness of OpenAI's GPT models in generating specifications based on separation logic.
arXiv Detail & Related papers (2024-11-04T17:44:11Z)
DiffSpec: Differential Testing with LLMs using Natural Language Specifications and Code Artifacts [6.222103009114668]
We introduce DiffSpec, a framework for generating differential tests with large language models. We demonstrate the efficacy of DiffSpec on two different systems, namely, eBPF runtimes and Wasm validators.
arXiv Detail & Related papers (2024-10-05T18:11:14Z)
Multi-Grained Specifications for Distributed System Model Checking and Verification [9.14614889088682]
We use TLA+ to model fine-grained behaviors of ZooKeeper and the TLC model checker to verify its correctness properties. We show that writing multi-grained specifications is a viable practice and can cope with model-code gaps without untenable state space.
arXiv Detail & Related papers (2024-09-22T02:59:56Z)
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains. We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs) We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z)
Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [75.25114727856861]
Large language models (LLMs) tend to suffer from deterioration at the latter stage ofSupervised fine-tuning process. We introduce a simple disperse-then-merge framework to address the issue. Our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.
arXiv Detail & Related papers (2024-05-22T08:18:19Z)
Requirements Traceability: Recovering and Visualizing Traceability Links Between Requirements and Source Code of Object-oriented Software Systems [0.0]
Requirement-to-Code Traceability Links (RtC-TLs) shape the relations between requirement and source code artifacts. This paper introduces YamenTrace, an automatic approach and implementation to recover and visualize RtC-TLs.
arXiv Detail & Related papers (2023-07-09T11:01:16Z)
Fact-Checking Complex Claims with Program-Guided Reasoning [99.7212240712869]
Program-Guided Fact-Checking (ProgramFC) is a novel fact-checking model that decomposes complex claims into simpler sub-tasks. We first leverage the in-context learning ability of large language models to generate reasoning programs. We execute the program by delegating each sub-task to the corresponding sub-task handler.
arXiv Detail & Related papers (2023-05-22T06:11:15Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Self-Supervised Log Parsing [59.04636530383049]
Large-scale software systems generate massive volumes of semi-structured log records. Existing approaches rely on log-specifics or manual rule extraction. We propose NuLog that utilizes a self-supervised learning model and formulates the parsing task as masked language modeling.
arXiv Detail & Related papers (2020-03-17T19:25:25Z)
Learning to Encode and Classify Test Executions [14.67675979776677]
The goal in this paper is to solve the test oracle problem in a way that is general, scalable and accurate. We label a small fraction of the execution traces with their verdict of pass or fail. We use labelled traces to train a neural network (NN) model to learn to distinguish runtime patterns for passing versus failing executions.
arXiv Detail & Related papers (2020-01-08T10:53:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.