Related papers: De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding

De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding

URL: http://arxiv.org/abs/2401.01701v3
Date: Wed, 19 Jun 2024 07:54:15 GMT
Title: De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
Authors: Aryaz Eghbali, Michael Pradel,
Abstract summary: Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks. LLMs are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs. This paper presents De-Hallucinator, a technique that grounds the predictions of an LLM through a novel combination of retrieving suitable API references.
Score: 18.129031749321058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks. However, these models are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs. Instead, LLMs often invent, or "hallucinate", non-existent APIs or produce variants of already existing code. This paper presents De-Hallucinator, a technique that grounds the predictions of an LLM through a novel combination of retrieving suitable API references and iteratively querying the model with increasingly suitable context information in the prompt. The approach exploits the observation that predictions by LLMs often resemble the desired code, but they fail to correctly refer to already existing APIs. De-Hallucinator automatically identifies project-specific API references related to the model's initial predictions and adds these references into the prompt. Unlike retrieval-augmented generation (RAG), our approach uses the initial prediction(s) by the model to iteratively retrieve increasingly suitable API references. Our evaluation applies the approach to two tasks: predicting API usages in Python and generating tests in JavaScript. We show that De-Hallucinator consistently improves the generated code across five LLMs. In particular, the approach improves the edit distance by 23.3-50.6% and the recall of correctly predicted API usages by 23.9-61.0% for code completion, and improves the number of fixed tests that initially failed because of hallucinations by 63.2%, resulting in a 15.5% increase in statement coverage for test generation.

Related papers

Identifying and Mitigating API Misuse in Large Language Models [26.4403427473915]
API misuse in code generated by large language models (LLMs) represents a serious emerging challenge in software development. This paper presents the first comprehensive study of API misuse patterns in LLM-generated code, analyzing both method selection and parameter usage across Python and Java. We propose Dr.Fix, a novel LLM-based automatic program repair approach for API misuse based on the aforementioned taxonomy.
arXiv Detail & Related papers (2025-03-28T18:43:12Z)
When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers? [10.204379646375182]
Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge. We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation.
arXiv Detail & Related papers (2025-03-19T14:08:47Z)
ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution. We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena. We propose a novel dynamic correction decoding method for MLLMs (DeCo) We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z)
How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study [13.633501449498402]
In large language models (LLMs), pre-trained or fine-tuned on large code corpora, code completion may struggle to use correct and up-to-date Application Programming Interfaces (APIs) due to the rapid and continuous evolution of libraries. This study involved seven advanced LLMs, 145 API mappings from eight popular Python libraries, and 28,125 completion prompts. We propose two lightweight fixing approaches, textscReplaceAPI and textscInsertPrompt, which can serve as baseline approaches for future research.
arXiv Detail & Related papers (2024-06-14T08:44:10Z)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z)
Citation-Enhanced Generation for LLM-based Chatbots [11.973280288131225]
Large language models (LLMs) exhibit powerful general intelligence across diverse scenarios. They may produce hallucinated content in responses, which significantly limits their applicability. We propose a novel post-hoc Citation-Enhanced Generation approach combined with retrieval argumentation.
arXiv Detail & Related papers (2024-02-25T11:24:41Z)
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z)
(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs [8.403074015356594]
Large Language Models (LLMs) are increasingly integrated into software applications. LLMs are often updated silently and scheduled to be deprecated. This can cause performance regression and affect prompt design choices.
arXiv Detail & Related papers (2023-11-18T17:11:12Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries. We propose a novel framework that emulates the process of programmers writing private code. We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z)
Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z)
On the Effectiveness of Pretrained Models for API Learning [8.788509467038743]
Developers frequently use APIs to implement certain functionalities, such as parsing Excel Files, reading and writing text files line by line, etc. Developers can greatly benefit from automatic API usage sequence generation based on natural language queries for building applications in a faster and cleaner manner. Existing approaches utilize information retrieval models to search for matching API sequences given a query or use RNN-based encoder-decoder to generate API sequences.
arXiv Detail & Related papers (2022-04-05T20:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.