Related papers: R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation

R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation

URL: http://arxiv.org/abs/2511.20090v2
Date: Wed, 26 Nov 2025 04:41:00 GMT
Title: R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation
Authors: Zizhang Luo, Fan Cui, Kexing Zhou, Runlin Guo, Mile Xia, Hongyuan Hou, Yun Liang,
Abstract summary: We propose R3A, an automatic program repair framework upon the basic model to improve reliability.<n>Experiments show R3A can fix 90.6% of bugs in the RTL-repair dataset within a given time limit.
Score: 3.5576449247822506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Repairing RTL bugs is crucial for hardware design and verification. Traditional automatic program repair (APR) methods define dedicated search spaces to locate and fix bugs with program synthesis. However, they heavily rely on fixed templates and can only deal with limited bugs. As an alternative, Large Language Models with the ability to understand code semantics can be explored for RTL repair. However, they suffer from unreliable outcomes due to inherent randomness and long input contexts of RTL code and waveform. To address these challenges, we propose R3A, an LLM-based automatic RTL program repair framework upon the basic model to improve reliability. R3A proposes the stochastic Tree-Of-Thoughts method to control a patch generation agent to explore a validated solution for the bug. The algorithm samples search states according to a heuristic function to balance between exploration and exploitation for a reliable outcome. Besides, R3A proposes a multi-agent fault localization method to find fault candidates as the starting points for the patch generation agent, further increasing the reliability. Experiments show R3A can fix 90.6% of bugs in the RTL-repair dataset within a given time limit, which covers 45% more bugs than traditional methods and other LLM-based approaches, while achieving an 86.7% pass@5 rate on average, showing a high reliability.

Related papers

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z)
Outcome-Conditioned Reasoning Distillation for Resolving Software Issues [49.16055123488827]
We present an Outcome-Conditioned Reasoning Distillation(O-CRD) framework that uses resolved in-repository issues with verified patches as supervision.<n>Starting from a historical fix, the method reconstructs a stage-wise repair trace backward from the verified outcome.<n>On SWE-Bench Lite, this approach increases Pass@1 by 10.4% with GPT-4o, 8.6% with DeepSeek-V3, and 10.3% with GPT-5.
arXiv Detail & Related papers (2026-01-30T18:25:39Z)
CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning [4.765206163164323]
CLEANER exploits intrinsic self-correction capabilities to eliminate error-contaminated context during data collection.<n>Similarity-Aware Adaptive Rollback mechanism autonomously constructs clean, purified trajectories.<n>Results show average accuracy gains of 6%, 3%, and 5% over baselines.
arXiv Detail & Related papers (2026-01-21T16:14:30Z)
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
Automated Repair of C Programs Using Large Language Models [0.0]
This study explores the potential of Large Language Models (LLMs) in automating the repair of C programs.<n>We present a framework that integrates spectrum-based fault localization (SBFL), runtime feedback, and Chain-of-Thought-structured prompting into an autonomous repair loop.<n>Our approach achieves 44.93% repair accuracy, representing a 3.61% absolute improvement over strong state-of-the-art APR baselines.
arXiv Detail & Related papers (2025-09-02T04:34:11Z)
Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs [5.10123605644148]
Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair.<n>Recent studies show that large language models (LLMs) outperform traditional techniques.
arXiv Detail & Related papers (2025-07-28T16:39:16Z)
Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [79.74676890436174]
We present an APR tool for Dafny that uses formal specifications as oracles for fault localization and repair.<n>We localize faults through a series of steps, which include using Hoare logic to determine the state of each statement within the program.<n>Our tool achieves 89.6% fault localization coverage and GPT-4o mini yields the highest repair success rate of 74.18%.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
EDA-Aware RTL Generation with Large Language Models [0.7831852829409273]
Large Language Models (LLMs) have become increasingly popular for generating RTL code.<n> producing error-free RTL code in a zero-shot setting remains highly challenging for even state-of-the-art LLMs.<n>We introduce AIvril2, a self-verifying, LLM-agnostic agentic framework aimed at enhancing RTL code generation through iterative corrections of both syntax and functional errors.
arXiv Detail & Related papers (2024-11-21T00:37:51Z)
MEIC: Re-thinking RTL Debug Automation using LLMs [18.964523115622928]
This work introduces a novel framework, Make Each Iteration Count (MEIC) MEIC is suitable for identifying and correcting both syntax and function errors. To evaluate our framework, we provide an open-source dataset comprising 178 common RTL programming errors.
arXiv Detail & Related papers (2024-05-10T22:32:39Z)
SBEST: Spectrum-Based Fault Localization Without Fault-Triggering Tests [17.90798133817018]
This study investigates the feasibility of using stack traces from crash reports as proxies for fault-triggering tests in Spectrum-Based Fault localization.<n>We propose SBEST, a novel approach that integrates stack trace information with test coverage data to perform fault localization when fault-triggering tests are missing.
arXiv Detail & Related papers (2024-05-01T15:15:52Z)
A Unified Debugging Approach via LLM-Based Multi-Agent Synergy [39.11825182386288]
FixAgent is an end-to-end framework for unified debug through multi-agent synergy. It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
arXiv Detail & Related papers (2024-04-26T04:55:35Z)
Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation [44.3761164214368]
We investigate Round-Trip Translation (RTT): translating code from one programming language into another programming or natural language and back.<n>We perform a detailed quantitative and qualitative analysis of RTT-generated patches in Java.<n>We find that RTT through English generates plausible patches for 100 of 164 bugs with GPT-4 on the HumanEval-Java benchmark, and 97 are found to be correct in our manual assessment.
arXiv Detail & Related papers (2024-01-15T22:36:31Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.