Related papers: We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs

URL: http://arxiv.org/abs/2406.10279v2
Date: Tue, 24 Sep 2024 21:46:56 GMT
Title: We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs
Authors: Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, Murtuza Jadliwala,
Abstract summary: Package hallucinations arise from fact-conflicting errors when generating code using Large Language Models. This paper conducts a rigorous and comprehensive evaluation of package hallucinations across different programming languages. We show that the average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models.
Score: 3.515912713354746
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The reliance of popular programming languages such as Python and JavaScript on centralized package repositories and open-source software, combined with the emergence of code-generating Large Language Models (LLMs), has created a new type of threat to the software supply chain: package hallucinations. These hallucinations, which arise from fact-conflicting errors when generating code using LLMs, represent a novel form of package confusion attack that poses a critical threat to the integrity of the software supply chain. This paper conducts a rigorous and comprehensive evaluation of package hallucinations across different programming languages, settings, and parameters, exploring how a diverse set of models and configurations affect the likelihood of generating erroneous package recommendations and identifying the root causes of this phenomenon. Using 16 popular LLMs for code generation and two unique prompt datasets, we generate 576,000 code samples in two programming languages that we analyze for package hallucinations. Our findings reveal that that the average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models, including a staggering 205,474 unique examples of hallucinated package names, further underscoring the severity and pervasiveness of this threat. To overcome this problem, we implement several hallucination mitigation strategies and show that they are able to significantly reduce the number of package hallucinations while maintaining code quality. Our experiments and findings highlight package hallucinations as a persistent and systemic phenomenon while using state-of-the-art LLMs for code generation, and a significant challenge which deserves the research community's urgent attention.

Related papers

Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges [1.397989121713806]
Large language models (LLMs) can fluently generate source code. LLMs are prone to generating hallucinations, which are incorrect, nonsensical, and not justifiable information. This survey investigates recent studies and techniques relevant to hallucinations generated by CodeLLMs.
arXiv Detail & Related papers (2025-04-29T14:13:57Z)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z)
Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities [11.868859925111561]
Large Language Models (LLMs) have become an essential tool in the programmer's toolkit. Their tendency to hallucinate code can be used by malicious actors to introduce vulnerabilities to broad swathes of the software supply chain.
arXiv Detail & Related papers (2025-01-31T10:26:18Z)
Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code [20.736888384234273]
We introduce Collu-Bench, a benchmark for predicting code hallucinations of large language models (LLMs) Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. We conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 -- 33.15% accuracy.
arXiv Detail & Related papers (2024-10-13T20:41:47Z)
LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation [33.46342144822026]
Code generation aims to automatically generate code from input requirements, significantly enhancing development efficiency. Recent large language models (LLMs) based approaches have shown promising results and revolutionized code generation task. Despite the promising performance, LLMs often generate contents with hallucinations, especially for the code generation scenario.
arXiv Detail & Related papers (2024-09-30T17:51:15Z)
$\mathbb{USCD}$: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding [64.00025564372095]
Large language models (LLMs) have shown remarkable capabilities in code generation. The effects of hallucinations (e.g., output noise) make it challenging for LLMs to generate high-quality code in one pass. We propose a simple and effective textbfuncertainty-aware textbfselective textbfcontrastive textbfdecoding.
arXiv Detail & Related papers (2024-09-09T02:07:41Z)
CodeMirage: Hallucinations in Code Generated by Large Language Models [6.063525456640463]
Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. We propose the first benchmark CodeMirage dataset for code hallucinations.
arXiv Detail & Related papers (2024-08-14T22:53:07Z)
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs. Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types. Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Code Hallucination [0.07366405857677226]
We present several types of code hallucination. We have generated such hallucinated code manually using large language models. We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination.
arXiv Detail & Related papers (2024-07-05T19:37:37Z)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z)
DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping [28.852274185512236]
npm is the most extensive package manager, hosting more than 2 million third-party open-source packages. In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details. We propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis.
arXiv Detail & Related papers (2024-03-13T08:38:21Z)
Alleviating Hallucinations of Large Language Models through Induced Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information. We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z)
Mutual Information Alleviates Hallucinations in Abstractive Summarization [73.48162198041884]
We find a simple criterion under which models are significantly more likely to assign more probability to hallucinated content during generation: high model uncertainty. This finding offers a potential explanation for hallucinations: models default to favoring text with high marginal probability, when uncertain about a continuation. We propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token--rather than purely the probability of the target token--when the model exhibits uncertainty.
arXiv Detail & Related papers (2022-10-24T13:30:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.