Related papers: Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog

Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog

URL: http://arxiv.org/abs/2409.15186v2
Date: Sun, 29 Sep 2024 07:35:24 GMT
Title: Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog
Authors: Bingkun Yao, Ning Wang, Jie Zhou, Xi Wang, Hong Gao, Zhe Jiang, Nan Guan,
Abstract summary: This paper presents Location-is-Key, an opensource LLM solution to locate functional errors in Verilog snippets. The pass@1 localization accuracy of 93.3% on our test dataset based on RTLLM, surpassed GPT-4's 77.9% and comparable to Claude-3.5's 90.8%.
Score: 28.954480748183943
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bug localization in Verilog code is a crucial and time-consuming task during the verification of hardware design. Since introduction, Large Language Models (LLMs) have showed their strong programming capabilities. However, no work has yet considered using LLMs for bug localization in Verilog code. This paper presents Location-is-Key, an opensource LLM solution to locate functional errors in Verilog snippets. LiK achieves high localization accuracy, with a pass@1 localization accuracy of 93.3% on our test dataset based on RTLLM, surpassing GPT-4's 77.9% and comparable to Claude-3.5's 90.8%. Additionally, the bug location obtained by LiK significantly improves GPT-3.5's bug repair efficiency (Functional pass@1 increased from 40.39% to 58.92%), highlighting the importance of bug localization in LLM-based Verilog debugging. Compared to existing methods, LiK only requires the design specification and the erroneous code snippet, without the need for testbenches, assertions, or any other EDA tools. This research demonstrates the feasibility of using LLMs for Verilog error localization, thus providing a new direction for automatic Verilog code debugging.

Related papers

Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [84.30534714651093]
We present an innovative APR tool for Dafny, a verification-aware programming language.<n>We localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program.<n>We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
VeriDebug: A Unified LLM for Verilog Debugging via Contrastive Embedding and Guided Correction [36.69082579950107]
We present Veri Debug, an approach that integrates contrastive representation and guided correction capabilities. Our model achieves 64.7 accuracy in bug fixing (Acc1), a significant improvement from the existing open-source SOTAs 11.3. This performance not only outperforms open-source alternatives but also exceeds larger closed-source models like GPT-3.5-turbo (36.6), offering a more accurate alternative to conventional debug methods.
arXiv Detail & Related papers (2025-04-27T04:09:48Z)
Where's the Bug? Attention Probing for Scalable Fault Localization [18.699014321422023]
We present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels. BAP is significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.
arXiv Detail & Related papers (2025-02-19T18:59:32Z)
Scalable, Validated Code Translation of Entire Projects using Large Language Models [13.059046327936393]
Large language models (LLMs) show promise in code translation due to their ability to generate idiomatic code. Existing works have shown a drop in translation success rates for code exceeding around 100 lines. We develop a modular approach to translation, where we partition the code into small code fragments which can be independently translated. We show that we can consistently generate reliable Rust for projects up to 6,600 lines of code and 369 functions, with an average of 73% of functions successfully validated for I/O equivalence.
arXiv Detail & Related papers (2024-12-11T02:31:46Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
MEIC: Re-thinking RTL Debug Automation using LLMs [18.964523115622928]
This work introduces a novel framework, Make Each Iteration Count (MEIC) MEIC is suitable for identifying and correcting both syntax and function errors. To evaluate our framework, we provide an open-source dataset comprising 178 common RTL programming errors.
arXiv Detail & Related papers (2024-05-10T22:32:39Z)
Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework [50.02710905062184]
This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. The accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark.
arXiv Detail & Related papers (2024-03-17T13:01:03Z)
Leveraging Print Debugging to Improve Code Generation in Large Language Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks. But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Large Language Models for Test-Free Fault Localization [11.080712737595174]
We propose a language model based fault localization approach that locates buggy lines of code without any test coverage information. We fine-tune language models with 350 million, 6 billion, and 16 billion parameters on small, manually curated corpora of buggy programs. Our empirical evaluation shows that LLMAO improves the Top-1 results over the state-of-the-art machine learning fault localization (MLFL) baselines by 2.3%-54.4%, and Top-5 results by 14.4%-35.6%.
arXiv Detail & Related papers (2023-10-03T01:26:39Z)
Large Language Models in Fault Localisation [32.87044163543427]
This paper investigates the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Within function-level context, ChatGPT-4 outperforms all the existing fault localisation methods. However, when the code context of the Defects4J dataset expands to the class-level, ChatGPT-4's performance suffers a significant drop.
arXiv Detail & Related papers (2023-08-29T13:07:27Z)
Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions [11.327913840111378]
We introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes. We empirically evaluate the performance of several state-of-the-art LLMs for the this task.
arXiv Detail & Related papers (2023-04-07T18:58:33Z)
Benchmarking Large Language Models for Automated Verilog RTL Code Generation [21.747037230069854]
We characterize the ability of large language models (LLMs) to generate useful Verilog. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code.
arXiv Detail & Related papers (2022-12-13T16:34:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.