Related papers: LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

URL: http://arxiv.org/abs/2411.09916v1
Date: Fri, 15 Nov 2024 03:29:41 GMT
Title: LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering
Authors: Jiessie Tie, Bingsheng Yao, Tianshi Li, Syed Ishtiaque Ahmed, Dakuo Wang, Shurui Zhou,
Abstract summary: We conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial software engineering task. We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users.
Score: 38.20696656193963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Software engineers are integrating AI assistants into their workflows to enhance productivity and reduce cognitive strain. However, experiences vary significantly, with some engineers finding large language models (LLMs), like ChatGPT, beneficial, while others consider them counterproductive. Researchers also found that ChatGPT's answers included incorrect information. Given the fact that LLMs are still imperfect, it is important to understand how to best incorporate LLMs into the workflow for software engineering (SE) task completion. Therefore, we conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial SE task to understand the practices, challenges, and opportunities for using LLMs for SE tasks. We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users. These findings contribute to the overall understanding and strategies for human-AI interaction on SE tasks. Our study also highlights future research and tooling support directions.

Related papers

Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny [79.56218230251953]
Students in computing education increasingly use large language models (LLMs) such as ChatGPT.<n>This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny.
arXiv Detail & Related papers (2025-06-27T16:34:13Z)
Junior Software Developers' Perspectives on Adopting LLMs for Software Engineering: a Systematic Literature Review [17.22501688824729]
This paper provides an overview of junior software developers' perspectives and use of Large Language Model-based tools for software engineering (LLM4SE) We conducted a systematic literature review following guidelines by Kitchenham et al. on 56 primary studies. Only 8.9% of the studies provide a clear definition for junior software developers, and there is no uniformity.
arXiv Detail & Related papers (2025-03-10T17:25:24Z)
Analysis of Student-LLM Interaction in a Software Engineering Project [1.2233362977312945]
We analyze 126 undergraduate students' interaction with an AI assistant during a 13-week semester to understand the benefits of AI for software engineering learning. Our findings suggest that students prefer ChatGPT over CoPilot. conversational-based interaction helps improve the quality of the code generated compared to auto-generated code.
arXiv Detail & Related papers (2025-02-03T11:44:00Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
An Exploratory Study on Upper-Level Computing Students' Use of Large Language Models as Tools in a Semester-Long Project [2.7325338323814328]
The purpose of this study is to explore computing students' experiences and approaches to using LLMs during a semester-long software engineering project. We collected data from a senior-level software engineering course at Purdue University. We analyzed the data to identify themes related to students' usage patterns and learning outcomes.
arXiv Detail & Related papers (2024-03-27T15:21:58Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking [5.755004576310333]
Large Language Model (LLM) assistants have emerged as potential alternatives to search methods for helping users navigate software. LLM assistants use vast training data from domain-specific texts, software manuals, and code repositories to mimic human-like interactions.
arXiv Detail & Related papers (2024-02-12T19:49:58Z)
Rocks Coding, Not Development--A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks [9.455579863269714]
We examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers.
arXiv Detail & Related papers (2024-02-08T13:07:31Z)
Efficient Tool Use with Chain-of-Abstraction Reasoning [65.18096363216574]
Large language models (LLMs) need to ground their reasoning to real-world knowledge. There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems. We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z)
An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project [1.433758865948252]
Large Language Models (LLMs) represent a leap in artificial intelligence, excelling in tasks using human language(s) In this paper, we analyze the AI-generated code, prompts used for code generation, and the human intervention levels to integrate the code into the code base. Our findings suggest that LLMs can play a crucial role in the early stages of software development.
arXiv Detail & Related papers (2024-01-29T14:32:32Z)
Lessons from Building StackSpot AI: A Contextualized AI Coding Assistant [2.268415020650315]
A new breed of tools, built atop Large Language Models, is emerging. These tools aim to mitigate drawbacks by employing techniques like fine-tuning or enriching user prompts with contextualized information.
arXiv Detail & Related papers (2023-11-30T10:51:26Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
How Can Recommender Systems Benefit from Large Language Models: A Survey [82.06729592294322]
Large language models (LLM) have shown impressive general intelligence and human-like capabilities. We conduct a comprehensive survey on this research direction from the perspective of the whole pipeline in real-world recommender systems.
arXiv Detail & Related papers (2023-06-09T11:31:50Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.