De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
- URL: http://arxiv.org/abs/2401.01701v3
- Date: Wed, 19 Jun 2024 07:54:15 GMT
- Title: De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
- Authors: Aryaz Eghbali, Michael Pradel,
- Abstract summary: Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks.
LLMs are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs.
This paper presents De-Hallucinator, a technique that grounds the predictions of an LLM through a novel combination of retrieving suitable API references.
- Score: 18.129031749321058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks. However, these models are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs. Instead, LLMs often invent, or "hallucinate", non-existent APIs or produce variants of already existing code. This paper presents De-Hallucinator, a technique that grounds the predictions of an LLM through a novel combination of retrieving suitable API references and iteratively querying the model with increasingly suitable context information in the prompt. The approach exploits the observation that predictions by LLMs often resemble the desired code, but they fail to correctly refer to already existing APIs. De-Hallucinator automatically identifies project-specific API references related to the model's initial predictions and adds these references into the prompt. Unlike retrieval-augmented generation (RAG), our approach uses the initial prediction(s) by the model to iteratively retrieve increasingly suitable API references. Our evaluation applies the approach to two tasks: predicting API usages in Python and generating tests in JavaScript. We show that De-Hallucinator consistently improves the generated code across five LLMs. In particular, the approach improves the edit distance by 23.3-50.6% and the recall of correctly predicted API usages by 23.9-61.0% for code completion, and improves the number of fixed tests that initially failed because of hallucinations by 63.2%, resulting in a 15.5% increase in statement coverage for test generation.
Related papers
- MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.
We propose a novel dynamic correction decoding method for MLLMs (DeCo)
We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study [13.633501449498402]
In large language models (LLMs), pre-trained or fine-tuned on large code corpora, code completion may struggle to use correct and up-to-date Application Programming Interfaces (APIs) due to the rapid and continuous evolution of libraries.
This study involved seven advanced LLMs, 145 API mappings from eight popular Python libraries, and 28,125 completion prompts.
We propose two lightweight fixing approaches, textscReplaceAPI and textscInsertPrompt, which can serve as baseline approaches for future research.
arXiv Detail & Related papers (2024-06-14T08:44:10Z) - CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification.
We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations.
We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z) - Citation-Enhanced Generation for LLM-based Chatbots [11.973280288131225]
Large language models (LLMs) exhibit powerful general intelligence across diverse scenarios.
They may produce hallucinated content in responses, which significantly limits their applicability.
We propose a novel post-hoc Citation-Enhanced Generation approach combined with retrieval argumentation.
arXiv Detail & Related papers (2024-02-25T11:24:41Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for
Evolving LLM APIs [8.403074015356594]
Large Language Models (LLMs) are increasingly integrated into software applications.
LLMs are often updated silently and scheduled to be deprecated.
This can cause performance regression and affect prompt design choices.
arXiv Detail & Related papers (2023-11-18T17:11:12Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries.
We propose a novel framework that emulates the process of programmers writing private code.
We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z) - Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES.
Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query.
By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z) - On the Effectiveness of Pretrained Models for API Learning [8.788509467038743]
Developers frequently use APIs to implement certain functionalities, such as parsing Excel Files, reading and writing text files line by line, etc.
Developers can greatly benefit from automatic API usage sequence generation based on natural language queries for building applications in a faster and cleaner manner.
Existing approaches utilize information retrieval models to search for matching API sequences given a query or use RNN-based encoder-decoder to generate API sequences.
arXiv Detail & Related papers (2022-04-05T20:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.