Related papers: REDO: Execution-Free Runtime Error Detection for COding Agents

REDO: Execution-Free Runtime Error Detection for COding Agents

URL: http://arxiv.org/abs/2410.09117v1
Date: Thu, 10 Oct 2024 18:06:29 GMT
Title: REDO: Execution-Free Runtime Error Detection for COding Agents
Authors: Shou Li, Andrey Kan, Laurent Callot, Bhavana Bhasker, Muhammad Shihab Rashid, Timothy B Esler,
Abstract summary: Execution-free Error Detection for COding Agents (REDO) is a method that integrates runtime errors with static analysis tools. We demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and a 9.1% higher weighted F1 score.
Score: 3.9903610503301072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection.

Related papers

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive. We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories [8.583591493627276]
We introduce JitVul, a vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. We show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code.
arXiv Detail & Related papers (2025-03-05T15:22:24Z)
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points [51.40935517552926]
We introduce Focused-DPO, a framework that enhances code generation by directing preference optimization towards critical error-prone areas. By focusing on error-prone points, Focused-DPO advances the accuracy and functionality of model-generated code.
arXiv Detail & Related papers (2025-02-17T06:16:02Z)
Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents [31.126001253902416]
We present the first study focused on identifying and detecting defects in LLM Agents. We collected and analyzed 6,854 relevant posts from StackOverflow to define 8 types of agent defects. Our results show that Agentable achieved an overall accuracy of 88.79% and a recall rate of 91.03%.
arXiv Detail & Related papers (2024-12-24T11:54:14Z)
EDA-Aware RTL Generation with Large Language Models [0.7831852829409273]
Large Language Models (LLMs) have become increasingly popular for generating RTL code. producing error-free RTL code in a zero-shot setting remains highly challenging for even state-of-the-art LLMs. We introduce AIvril2, a self-verifying, LLM-agnostic agentic framework aimed at enhancing RTL code generation through iterative corrections of both syntax and functional errors.
arXiv Detail & Related papers (2024-11-21T00:37:51Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
An Empirical Study on LLM-based Agents for Automated Bug Fixing [2.433168823911037]
Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically. We examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing.
arXiv Detail & Related papers (2024-11-15T14:19:15Z)
ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation [31.363781211927947]
Large language models (LLMs) have achieved impressive performance in code generation. LLMs are susceptible to error accumulation during code generation. We propose ROCODE, which integrates the backtracking mechanism and program analysis into LLMs for code generation.
arXiv Detail & Related papers (2024-11-11T16:39:13Z)
Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents [7.392058124132526]
Foundations models (FMs) play an increasingly prominent role in complex software systems, such as agentic software. Fast-thinking Large Language Models (LLMs) are still preferred due to latency constraints. We introduce Watson, a framework that provides reasoning observability into implicit reasoning processes.
arXiv Detail & Related papers (2024-11-05T19:13:22Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing [6.334110674473677]
Existing approaches often rely on a single agent for code generation, which struggles to produce secure, vulnerability-free code. We propose AutoSafeCoder, a multi-agent framework that leverages LLM-driven agents for code generation, vulnerability analysis, and security enhancement through continuous collaboration. Our contribution focuses on ensuring the safety of multi-agent code generation by integrating dynamic and static testing in an iterative process during code generation.
arXiv Detail & Related papers (2024-09-16T21:15:56Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Agent-Driven Automatic Software Improvement [55.2480439325792]
This research proposal aims to explore innovative solutions by focusing on the deployment of agents powered by Large Language Models (LLMs) The iterative nature of agents, which allows for continuous learning and adaptation, can help surpass common challenges in code generation. We aim to use the iterative feedback in these systems to further fine-tune the LLMs underlying the agents, becoming better aligned to the task of automated software improvement.
arXiv Detail & Related papers (2024-06-24T15:45:22Z)
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z)
A Unified Debugging Approach via LLM-Based Multi-Agent Synergy [39.11825182386288]
FixAgent is an end-to-end framework for unified debug through multi-agent synergy. It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
arXiv Detail & Related papers (2024-04-26T04:55:35Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
RL-GPT: Integrating Reinforcement Learning and Code-as-policy [82.1804241891039]
We introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent. The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks. This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline.
arXiv Detail & Related papers (2024-02-29T16:07:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.