CodeMirage: Hallucinations in Code Generated by Large Language Models
- URL: http://arxiv.org/abs/2408.08333v1
- Date: Wed, 14 Aug 2024 22:53:07 GMT
- Title: CodeMirage: Hallucinations in Code Generated by Large Language Models
- Authors: Vibhor Agarwal, Yulong Pei, Salwa Alamir, Xiaomo Liu,
- Abstract summary: Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation.
LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect.
We propose the first benchmark CodeMirage dataset for code hallucinations.
- Score: 6.063525456640463
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI's GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.
Related papers
- ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries [29.561699707926056]
Large language models (LLMs) are prone to hallucination-outputs that stray from intended meanings.
We introduce a first-of-its-kind dataset with $sim$10K samples, curated specifically for hallucination detection in code summarization.
arXiv Detail & Related papers (2024-10-17T19:38:55Z) - MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.
We propose a novel dynamic correction decoding method for MLLMs (DeCo)
We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code [20.736888384234273]
We introduce Collu-Bench, a benchmark for predicting code hallucinations of large language models (LLMs)
Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones.
We conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 -- 33.15% accuracy.
arXiv Detail & Related papers (2024-10-13T20:41:47Z) - Code Hallucination [0.07366405857677226]
We present several types of code hallucination.
We have generated such hallucinated code manually using large language models.
We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination.
arXiv Detail & Related papers (2024-07-05T19:37:37Z) - Data-augmented phrase-level alignment for mitigating object hallucination [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination.
We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification.
We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations.
We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z) - Exploring and Evaluating Hallucinations in LLM-Powered Code Generation [14.438161741833687]
Large Language Models (LLMs) produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with factual knowledge.
Existing work mainly focuses on investing the hallucination in the domain of natural language generation.
We conduct a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations present in it.
We propose HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations.
arXiv Detail & Related papers (2024-04-01T07:31:45Z) - The Dawn After the Dark: An Empirical Study on Factuality Hallucination
in Large Language Models [134.6697160940223]
hallucination poses great challenge to trustworthy and reliable deployment of large language models.
Three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them.
This work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation.
arXiv Detail & Related papers (2024-01-06T12:40:45Z) - Alleviating Hallucinations of Large Language Models through Induced
Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information.
We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z) - HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge.
To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.