Right or Wrong -- Understanding How Novice Users Write Software Models
- URL: http://arxiv.org/abs/2402.06624v3
- Date: Sat, 30 Mar 2024 16:45:40 GMT
- Title: Right or Wrong -- Understanding How Novice Users Write Software Models
- Authors: Ana Jovanovic, Allison Sullivan,
- Abstract summary: This paper presents an empirical study of over 97,000 models written by novice users trying to learn Alloy.
We investigate how users write both correct and incorrect models in order to produce a comprehensive benchmark for future use.
- Score: 0.6445605125467574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Writing declarative models has numerous benefits, ranging from automated reasoning and correction of design-level properties before systems are built, to automated testing and debugging of their implementations after they are built. Alloy is a declarative modeling language that is well-suited for verifying system designs. A key strength of Alloy is its scenario-finding toolset, the Analyzer, which allows users to explore all valid scenarios that adhere to the model's constraints up to a user-provided scope. However, even with visualized scenarios, it is difficult to write correct Alloy models. To address this, a growing body of work explores different techniques for debugging Alloy models. In order to develop and evaluate these techniques in an effective manor, this paper presents an empirical study of over 97,000 models written by novice users trying to learn Alloy. We investigate how users write both correct and incorrect models in order to produce a comprehensive benchmark for future use as well as a series of observations to guide debugging and educational efforts for Alloy model development.
Related papers
- Exploring Efficient Foundational Multi-modal Models for Video Summarization [15.418001616659808]
Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space.
We propose a plug-and-play video language model, using texts generated from each input modality into the language model.
We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods.
arXiv Detail & Related papers (2024-10-09T20:07:06Z) - Structure Editor for Building Software Models [0.5735035463793009]
A recent study of over 93,000 new user models reveals that users have trouble from the very start: nearly a third of the models novices write fail to compile.
We believe that the issue is that Alloy's grammar and type information is passively relayed to the user despite this information outlining a narrow path for how to compose valid formulas.
In this paper, we outline a proof-of-concept for a structure editor for Alloy in which user's build their models using block based inputs, rather than free typing, which by design prevents compilation errors.
arXiv Detail & Related papers (2024-06-13T18:21:02Z) - Deciphering AutoML Ensembles: cattleia's Assistance in Decision-Making [0.0]
Cattleia is an application that deciphers the ensembles for regression, multiclass, and binary classification tasks.
It works with models built by three AutoML packages: auto-sklearn, AutoGluon, and FLAML.
arXiv Detail & Related papers (2024-03-19T11:56:21Z) - Collaborative decoding of critical tokens for boosting factuality of
large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation.
The common practice of using sampling during generation also increases chances of hallucination.
We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - What is the best recipe for character-level encoder-only modelling? [2.792030485253753]
This paper aims to benchmark recent progress in language understanding models that output contextualised representations at the character level.
We find that our best performing character-level model exceeds the performance of a token-based model trained with the same settings on the same data.
We believe our results demonstrate the readiness of character-level models for multilingual language representation, and encourage NLP practitioners to try them as drop-in replacements for token-based models.
arXiv Detail & Related papers (2023-05-09T14:00:15Z) - Assessing Out-of-Domain Language Model Performance from Few Examples [38.245449474937914]
We address the task of predicting out-of-domain (OOD) performance in a few-shot fashion.
We benchmark the performance on this task when looking at model accuracy on the few-shot examples.
We show that attribution-based factors can help rank relative model OOD performance.
arXiv Detail & Related papers (2022-10-13T04:45:26Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.