HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing
- URL: http://arxiv.org/abs/2509.23835v2
- Date: Sat, 04 Oct 2025 05:29:15 GMT
- Title: HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing
- Authors: Yukai Zhao, Menghan Wu, Xing Hu, Xin Xia,
- Abstract summary: Large Language Models (LLMs) are widely used for code generation, but they face critical security risks when applied to practical production.<n>It is critical to test LLMs for package hallucinations to mitigate package hallucinations and defend against potential attacks.<n>We propose HFUZZER, a novel phrase-based fuzzing framework to test LLMs for package hallucinations.
- Score: 8.667234284704655
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) are widely used for code generation, but they face critical security risks when applied to practical production due to package hallucinations, in which LLMs recommend non-existent packages. These hallucinations can be exploited in software supply chain attacks, where malicious attackers exploit them to register harmful packages. It is critical to test LLMs for package hallucinations to mitigate package hallucinations and defend against potential attacks. Although researchers have proposed testing frameworks for fact-conflicting hallucinations in natural language generation, there is a lack of research on package hallucinations. To fill this gap, we propose HFUZZER, a novel phrase-based fuzzing framework to test LLMs for package hallucinations. HFUZZER adopts fuzzing technology and guides the model to infer a wider range of reasonable information based on phrases, thereby generating enough and diverse coding tasks. Furthermore, HFUZZER extracts phrases from package information or coding tasks to ensure the relevance of phrases and code, thereby improving the relevance of generated tasks and code. We evaluate HFUZZER on multiple LLMs and find that it triggers package hallucinations across all selected models. Compared to the mutational fuzzing framework, HFUZZER identifies 2.60x more unique hallucinated packages and generates more diverse tasks. Additionally, when testing the model GPT-4o, HFUZZER finds 46 unique hallucinated packages. Further analysis reveals that for GPT-4o, LLMs exhibit package hallucinations not only during code generation but also when assisting with environment configuration.
Related papers
- Secure or Suspect? Investigating Package Hallucinations of Shell Command in Original and Quantized LLMs [7.21976012124109]
We conduct the first systematic empirical study of the impact of quantization on package hallucination and vulnerability risks in Go packages.<n>Our results show that quantization substantially increases the package hallucination rate (PHR), with 4-bit models exhibiting the most severe degradation.<n>Our analysis of hallucinated outputs reveals that most fabricated packages resemble realistic URL-based Go module paths.
arXiv Detail & Related papers (2025-12-09T03:47:31Z) - A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI [54.34738767990601]
As Large Language Models become increasingly integrated into software engineering tasks, understanding and mitigating hallucination in code becomes essential.<n>We provide a systematic review of hallucination phenomena in code-oriented LLMs from four key perspectives.
arXiv Detail & Related papers (2025-11-02T02:58:41Z) - Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization [55.543583937522804]
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks.<n>Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate.<n>In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations.
arXiv Detail & Related papers (2025-08-27T18:02:04Z) - Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities [11.868859925111561]
Large Language Models (LLMs) have become an essential tool in the programmer's toolkit.<n>Their tendency to hallucinate code can be used by malicious actors to introduce vulnerabilities to broad swathes of the software supply chain.
arXiv Detail & Related papers (2025-01-31T10:26:18Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - CodeMirage: Hallucinations in Code Generated by Large Language Models [6.063525456640463]
Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation.<n>LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect.<n>We propose the first benchmark CodeMirage dataset for code hallucinations.
arXiv Detail & Related papers (2024-08-14T22:53:07Z) - Code Hallucination [0.07366405857677226]
We present several types of code hallucination.
We have generated such hallucinated code manually using large language models.
We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination.
arXiv Detail & Related papers (2024-07-05T19:37:37Z) - We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs [3.515912713354746]
Package hallucinations arise from fact-conflicting errors when generating code using Large Language Models.<n>This paper conducts a rigorous and comprehensive evaluation of package hallucinations across different programming languages.<n>We show that the average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models.
arXiv Detail & Related papers (2024-06-12T03:29:06Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [40.930238150365795]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.<n>We generate a small-size hallucination annotation dataset by proprietary models.<n>Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge.
To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.