STL: Surprisingly Tricky Logic (for System Validation)
- URL: http://arxiv.org/abs/2305.17258v1
- Date: Fri, 26 May 2023 21:01:26 GMT
- Title: STL: Surprisingly Tricky Logic (for System Validation)
- Authors: Ho Chit Siu, Kevin Leahy, and Makai Mann
- Abstract summary: Ground-truth validity of a specification, subjects' familiarity with formal methods, and subjects' level of education were found to be significant factors in determining validation correctness.
Participants exhibited an affirmation bias, causing significantly increased accuracy on valid specifications, but significantly decreased accuracy on invalid specifications.
- Score: 0.04301276597844757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Much of the recent work developing formal methods techniques to specify or
learn the behavior of autonomous systems is predicated on a belief that formal
specifications are interpretable and useful for humans when checking systems.
Though frequently asserted, this assumption is rarely tested. We performed a
human experiment (N = 62) with a mix of people who were and were not familiar
with formal methods beforehand, asking them to validate whether a set of signal
temporal logic (STL) constraints would keep an agent out of harm and allow it
to complete a task in a gridworld capture-the-flag setting. Validation accuracy
was $45\% \pm 20\%$ (mean $\pm$ standard deviation). The ground-truth validity
of a specification, subjects' familiarity with formal methods, and subjects'
level of education were found to be significant factors in determining
validation correctness. Participants exhibited an affirmation bias, causing
significantly increased accuracy on valid specifications, but significantly
decreased accuracy on invalid specifications. Additionally, participants,
particularly those familiar with formal methods, tended to be overconfident in
their answers, and be similarly confident regardless of actual correctness.
Our data do not support the belief that formal specifications are inherently
human-interpretable to a meaningful degree for system validation. We recommend
ergonomic improvements to data presentation and validation training, which
should be tested before claims of interpretability make their way back into the
formal methods literature.
Related papers
- FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees [41.78390564658645]
Large Language Models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains.
We introduce FactTest, a novel framework that statistically assesses whether a LLM can confidently provide correct answers to given questions.
We show that FactTest effectively detects hallucinations and improves the model's ability to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
arXiv Detail & Related papers (2024-11-04T20:53:04Z) - LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Consistent Validation for Predictive Methods in Spatial Settings [17.44650272751289]
spatial prediction tasks are key to weather forecasting, studying air pollution, and other scientific endeavors.
Classical approaches for validation fail to handle mismatch between locations available for validation and (test) locations where we want to make predictions.
We formalize a check on validation methods: that they become arbitrarily accurate as validation data becomes arbitrarily dense.
arXiv Detail & Related papers (2024-02-05T21:33:22Z) - Efficient Conformal Prediction under Data Heterogeneity [79.35418041861327]
Conformal Prediction (CP) stands out as a robust framework for uncertainty quantification.
Existing approaches for tackling non-exchangeability lead to methods that are not computable beyond the simplest examples.
This work introduces a new efficient approach to CP that produces provably valid confidence sets for fairly general non-exchangeable data distributions.
arXiv Detail & Related papers (2023-12-25T20:02:51Z) - Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - Testing for Overfitting [0.0]
We discuss the overfitting problem and explain why standard and concentration results do not hold for evaluation with training data.
We introduce and argue for a hypothesis test by means of which both model performance may be evaluated using training data.
arXiv Detail & Related papers (2023-05-09T22:49:55Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Exploring the Capacity of a Large-scale Masked Language Model to
Recognize Grammatical Errors [3.55517579369797]
We show that 5 to 10% of training data are enough for a BERT-based error detection method to achieve performance equivalent to a non-language model-based method.
We also show with pseudo error data that it actually exhibits such nice properties in learning rules for recognizing various types of error.
arXiv Detail & Related papers (2021-08-27T10:37:14Z) - Information-Theoretic Probing with Minimum Description Length [74.29846942213445]
We propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL)
With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data.
We show that these methods agree in results and are more informative and stable than the standard probes.
arXiv Detail & Related papers (2020-03-27T09:35:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.