Related papers: Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions

Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions

URL: http://arxiv.org/abs/2302.07248v1
Date: Tue, 14 Feb 2023 18:43:34 GMT
Title: Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions
Authors: Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q. Vera Liao, and Jennifer Wortman Vaughan
Abstract summary: We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
Score: 40.961506036644444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale generative models enabled the development of AI-powered code completion tools to assist programmers in writing code. However, much like other AI-powered tools, AI-powered code completions are not always accurate, potentially introducing bugs or even security vulnerabilities into code if not properly detected and corrected by a human programmer. One technique that has been proposed and implemented to help programmers identify potential errors is to highlight uncertain tokens. However, there have been no empirical studies exploring the effectiveness of this technique-- nor investigating the different and not-yet-agreed-upon notions of uncertainty in the context of generative models. We explore the question of whether conveying information about uncertainty enables programmers to more quickly and accurately produce code when collaborating with an AI-powered code completion tool, and if so, what measure of uncertainty best fits programmers' needs. Through a mixed-methods study with 30 programmers, we compare three conditions: providing the AI system's code completion alone, highlighting tokens with the lowest likelihood of being generated by the underlying generative model, and highlighting tokens with the highest predicted likelihood of being edited by a programmer. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits, and is subjectively preferred by study participants. In contrast, highlighting tokens according to their probability of being generated does not provide any benefit over the baseline with no highlighting. We further explore the design space of how to convey uncertainty in AI-powered code completion tools, and find that programmers prefer highlights that are granular, informative, interpretable, and not overwhelming.

Related papers

Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
ACE: Automated Technical Debt Remediation with Validated Large Language Model Refactorings [8.0322025529523]
This paper introduces Augmented Code Engineering (ACE), a tool that automates code improvements using validated output.<n>Early feedback from users suggests that AI-enabled helps mitigate code-level technical debt that otherwise rarely gets acted upon.
arXiv Detail & Related papers (2025-07-04T12:39:27Z)
Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights [0.0]
This paper introduces a novel scoring mechanism called the SBC score. It is based on a reverse generation technique that leverages the natural language generation capabilities of Large Language Models. Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications.
arXiv Detail & Related papers (2025-02-11T01:12:11Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
DeVAIC: A Tool for Security Assessment of AI-generated Code [5.383910843560784]
DeVAIC (Detection of Vulnerabilities in AI-generated Code) is a tool to evaluate the security of AI-generated Python code.
arXiv Detail & Related papers (2024-04-11T08:27:23Z)
Genetic Auto-prompt Learning for Pre-trained Code Intelligence Language Models [54.58108387797138]
We investigate the effectiveness of prompt learning in code intelligence tasks. Existing automatic prompt design methods are very limited to code intelligence tasks. We propose Genetic Auto Prompt (GenAP) which utilizes an elaborate genetic algorithm to automatically design prompts.
arXiv Detail & Related papers (2024-03-20T13:37:00Z)
Students' Perspective on AI Code Completion: Benefits and Challenges [2.936007114555107]
We investigated the benefits, challenges, and expectations of AI code completion from students' perspectives. Our findings show that AI code completion enhanced students' productivity and efficiency by providing correct syntax suggestions. In the future, AI code completion should be explainable and provide best coding practices to enhance the education process.
arXiv Detail & Related papers (2023-10-31T22:41:16Z)
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [85.03486419424647]
KnowNo is a framework for measuring and aligning the uncertainty of large language models. KnowNo builds on the theory of conformal prediction to provide statistical guarantees on task completion.
arXiv Detail & Related papers (2023-07-04T21:25:12Z)
From Copilot to Pilot: Towards AI Supported Software Development [3.0585424861188844]
We study the limitations of AI-supported code completion tools like Copilot and offer a taxonomy for understanding the classification of AI-supported code completion tools in this space. We then conduct additional investigation to determine the current boundaries of AI-supported code completion tools like Copilot. We conclude by providing a discussion on challenges for future development of AI-supported code completion tools to reach the design level of abstraction in our taxonomy.
arXiv Detail & Related papers (2023-03-07T18:56:52Z)
Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology. The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z)
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models [25.726216146776054]
We show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task. We propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value.
arXiv Detail & Related papers (2022-10-29T05:03:28Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.