LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models
- URL: http://arxiv.org/abs/2504.01404v1
- Date: Wed, 02 Apr 2025 06:40:57 GMT
- Title: LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models
- Authors: Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, Lingfeng Bao,
- Abstract summary: The SZZ algorithm is the dominant technique for identifying bug-inducing commits.<n>It serves as a foundation for many software engineering studies, such as bug prediction and static code analysis.<n>Recently, a deep learning-based SZZ algorithm has been introduced to enhance the original SZZ algorithm.
- Score: 10.525352489242398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The SZZ algorithm is the dominant technique for identifying bug-inducing commits and serves as a foundation for many software engineering studies, such as bug prediction and static code analysis. Researchers have proposed many variants to enhance the SZZ algorithm's performance since its introduction. The majority of them rely on static techniques or heuristic assumptions, making them easy to implement, but their performance improvements are often limited. Recently, a deep learning-based SZZ algorithm has been introduced to enhance the original SZZ algorithm. However, it requires complex preprocessing and is restricted to a single programming language. Additionally, while it enhances precision, it sacrifices recall. Furthermore, most of variants overlook crucial information, such as commit messages and patch context, and are limited to bug-fixing commits involving deleted lines. The emergence of large language models (LLMs) offers an opportunity to address these drawbacks. In this study, we investigate the strengths and limitations of LLMs and propose LLM4SZZ, which employs two approaches (i.e., rank-based identification and context-enhanced identification) to handle different types of bug-fixing commits. We determine which approach to adopt based on the LLM's ability to comprehend the bug and identify whether the bug is present in a commit. The context-enhanced identification provides the LLM with more context and requires it to find the bug-inducing commit among a set of candidate commits. In rank-based identification, we ask the LLM to select buggy statements from the bug-fixing commit and rank them based on their relevance to the root cause. Experimental results show that LLM4SZZ outperforms all baselines across three datasets, improving F1-score by 6.9% to 16.0% without significantly sacrificing recall.
Related papers
- SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers [16.80818230868491]
This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers.
To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024.
Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions.
arXiv Detail & Related papers (2025-03-31T22:02:24Z) - WIA-SZZ: Work Item Aware SZZ [3.7232697932311645]
Existing SZZ algorithms identify the potential commit that induced a bug when given a fixing commit as input.
We build a new variant of SZZ that leverages our work item detecting commits to first suggest bug-inducing commits.
Our evaluation reveals 64% is accurate in finding work items, but most importantly it is able to find many bug-inducing commits.
arXiv Detail & Related papers (2024-11-19T18:59:14Z) - Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping [60.458273797431836]
Decoding by contrasting layers (DoLa) is designed to improve the generation quality of large language models.
We find that this approach does not work well on non-English tasks.
Inspired by previous interpretability work on language transition during the model's forward pass, we propose an improved contrastive decoding algorithm.
arXiv Detail & Related papers (2024-07-15T15:14:01Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing [6.042114639413868]
Specialized fuzzers can handle complex structured data, but require additional efforts in grammar and suffer from low throughput.
In this paper, we explore the potential of utilizing the Large Language Model to enhance greybox fuzzing for structured data.
Our LLM-based fuzzer, LLAMAFUZZ, integrates the power of LLM to understand and mutate structured data to fuzzing.
arXiv Detail & Related papers (2024-06-11T20:48:28Z) - Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models [24.085614720512744]
This study shows that large language models (LLMs) are vulnerable to changes in the number and arrangement of options in text classification.
Key bottleneck arises from ambiguous decision boundaries and inherent biases towards specific tokens and positions.
Our approach is grounded in the empirical observation that pairwise comparisons can effectively alleviate boundary ambiguity and inherent bias.
arXiv Detail & Related papers (2024-06-11T06:53:19Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel [8.698309437598944]
The evaluation of how ghost commits impact the SZZ algorithm remains limited.
Linux kernel developers have started labelling bug-fixing patches with the commit identifiers of the corresponding bug-inducing commit(s) as a standard practice.
In this paper, we apply six SZZ algorithms to 76,046 pairs of bug-fixing patches and bug-inducing commits from the Linux kernel.
arXiv Detail & Related papers (2023-08-09T16:41:27Z) - GRACE: Discriminator-Guided Chain-of-Thought Reasoning [75.35436025709049]
We propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE) to steer the decoding process towards producing correct reasoning steps.
GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates.
arXiv Detail & Related papers (2023-05-24T09:16:51Z) - ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle
Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems.
We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness.
Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.