WatChat: Explaining perplexing programs by debugging mental models
- URL: http://arxiv.org/abs/2403.05334v2
- Date: Wed, 02 Oct 2024 17:05:24 GMT
- Title: WatChat: Explaining perplexing programs by debugging mental models
- Authors: Kartik Chandra, Katherine M. Collins, Will Crichton, Tony Chen, Tzu-Mao Li, Adrian Weller, Rachit Nigam, Joshua Tenenbaum, Jonathan Ragan-Kelley,
- Abstract summary: We build systems for explanation in two domains: JavaScript type coercion, and the Git version control system.
We show that WatChat's explanations exhibit key features of human-written explanation, unlike those of a state-of-the-art language model.
- Score: 33.238462470842386
- License:
- Abstract: Often, a good explanation for a program's unexpected behavior is a bug in the programmer's code. But sometimes, an even better explanation is a bug in the programmer's mental model of the language or API they are using. Instead of merely debugging our current code ("giving the programmer a fish"), what if our tools could directly debug our mental models ("teaching the programmer to fish")? In this paper, we apply recent ideas from computational cognitive science to offer a principled framework for doing exactly that. Given a "why?" question about a program, we automatically infer potential misconceptions about the language/API that might cause the user to be surprised by the program's behavior -- and then analyze those misconceptions to provide explanations of the program's behavior. Our key idea is to formally represent misconceptions as counterfactual (erroneous) semantics for the language/API, which can be inferred and debugged using program synthesis techniques. We demonstrate our framework, WatChat, by building systems for explanation in two domains: JavaScript type coercion, and the Git version control system. We evaluate WatChatJS and WatChatGit by comparing their outputs to experimentally-collected human-written explanations in these two domains: we show that WatChat's explanations exhibit key features of human-written explanation, unlike those of a state-of-the-art language model.
Related papers
- NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs.
We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z) - What is a "bug"? On subjectivity, epistemic power, and implications for
software research [8.116831482130555]
"Bug" has been a colloquialism for an engineering "defect" at least since the 1870s.
Most modern software-oriented definitions speak to a disconnect between what a developer intended and what a program actually does.
"Finding bugs is easy" begins by saying "bug patterns are code that are often errors"
arXiv Detail & Related papers (2024-02-13T01:52:42Z) - GuardRails: Automated Suggestions for Clarifying Ambiguous Purpose Statements [0.0]
Before a function, programmers are encouraged to write a purpose statement i.e., a short, natural-language explanation of what the function computes.
A purpose statement may be ambiguous i.e., it may fail to specify the intended behaviour when two or more inequivalent computations are plausible on certain inputs.
We propose a novel that suggests such inputs using Large Language Models (LLMs)
We create an open-source implementation of our dataset as an extension to Visual Studio Code for the Python programming language.
arXiv Detail & Related papers (2023-12-13T14:56:42Z) - Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning [84.12154024070024]
We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks.
Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge.
A Python interpreter then executes the generated code and prints the output.
arXiv Detail & Related papers (2023-09-19T17:54:21Z) - On Feasibility of Declarative Diagnosis [0.0]
We argue that useful ways of declarative diagnosis of logic programs exist, and should be usable in actual programming.
This paper discusses their possibly main weaknesses and shows how to overcome them.
arXiv Detail & Related papers (2023-08-30T08:56:19Z) - Generative Models as a Complex Systems Science: How can we make sense of
large language model behavior? [75.79305790453654]
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP.
We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance.
arXiv Detail & Related papers (2023-07-31T22:58:41Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Understanding Programs by Exploiting (Fuzzing) Test Cases [26.8259045248779]
We propose to incorporate the relationship between inputs and possible outputs/behaviors into learning, for achieving a deeper semantic understanding of programs.
To obtain inputs that are representative enough to trigger the execution of most part of the code, we resort to fuzz testing and propose fuzz tuning.
The effectiveness of the proposed method is verified on two program understanding tasks including code clone detection and code classification, and it outperforms current state-of-the-arts by large margins.
arXiv Detail & Related papers (2023-05-23T01:51:46Z) - Using Large Language Models to Enhance Programming Error Messages [5.903720638984496]
Large language models can be used to create useful enhancements to programming error messages.
We discuss the benefits and downsides of large language models and highlight future streams of research for enhancing programming error messages.
arXiv Detail & Related papers (2022-10-20T23:17:26Z) - Diagnosing AI Explanation Methods with Folk Concepts of Behavior [70.10183435379162]
We consider "success" to depend not only on what information the explanation contains, but also on what information the human explainee understands from it.
We use folk concepts of behavior as a framework of social attribution by the human explainee.
arXiv Detail & Related papers (2022-01-27T00:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.