Related papers: On Mitigating Code LLM Hallucinations with API Documentation

On Mitigating Code LLM Hallucinations with API Documentation

URL: http://arxiv.org/abs/2407.09726v1
Date: Sat, 13 Jul 2024 00:16:26 GMT
Title: On Mitigating Code LLM Hallucinations with API Documentation
Authors: Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, Varun Kumar,
Abstract summary: We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance.
Score: 22.933186524255593
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs struggle with low frequency APIs: for e.g., GPT-4o achieves only 38.58% valid low frequency API invocations. We demonstrate that Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs (increase to 47.94% with DAG) but negatively impacts high frequency APIs when using sub-optimal retrievers (a 39.02% absolute drop). To mitigate this, we propose to intelligently trigger DAG where we check against an API index or leverage Code LLMs' confidence scores to retrieve only when needed. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance, resulting in more reliable API invocations (8.20% absolute improvement on CloudAPIBench for GPT-4o).

Related papers

APIRAT: Integrating Multi-source API Knowledge for Enhanced Code Translation with LLMs [6.522570957351905]
APIRAT is a novel code translation method that integrates multi-source API knowledge. APIRAT employs three API knowledge augmentation techniques, including API sequence retrieval, API sequence back-translation, and API mapping. Experiments indicate that APIRAT significantly surpasses existing LLM-based methods, achieving improvements in computational accuracy ranging from 4% to 15.1%.
arXiv Detail & Related papers (2025-04-21T04:24:49Z)
LLM-assisted Mutation for Whitebox API Testing [40.91007243855959]
MioHint is a novel white-box API testing approach that leverages the code comprehension capabilities of Large Language Model (LLM) to boost API testing. To evaluate the effectiveness of our method, we conducted experiments across 16 real-world API services.
arXiv Detail & Related papers (2025-04-08T07:14:51Z)
Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors. Traditional fuzzing struggles with the complexity and API diversity of DL libraries. We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution. We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs [46.65963514391019]
We present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing. We integrate Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs) Our approach treats REST API testing as a separable problem, where four agents -- API, dependency, parameter, and value -- collaborate to optimize API exploration.
arXiv Detail & Related papers (2024-11-11T16:20:27Z)
AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation [16.590226868986296]
AutoFeedback is a framework for efficient and accurate API request generation. It implements two feedback loops during the process of generating API requests by the Large Language Models. It achieves an accuracy of 100.00% on a real-world API dataset and reduces the cost of interaction with GPT-3.5 Turbo by 23.44%, and GPT-4 Turbo by 11.85%.
arXiv Detail & Related papers (2024-10-09T14:38:28Z)
SEAL: Suite for Evaluating API-use of LLMs [1.2528321519119252]
SEAL is an end-to-end testbed designed to evaluate large language models in real-world API usage. It standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs.
arXiv Detail & Related papers (2024-09-23T20:16:49Z)
A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How [53.65636914757381]
API suggestion is a critical task in modern software development. Recent advancements in large code models (LCMs) have shown promise in the API suggestion task.
arXiv Detail & Related papers (2024-09-20T03:12:35Z)
FANTAstic SEquences and Where to Find Them: Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking [57.53742155914176]
API call generation is the cornerstone of large language models' tool-using ability. Existing supervised and in-context learning approaches suffer from high training costs, poor data efficiency, and generated API calls that can be unfaithful to the API documentation and the user's request. We propose an output-side optimization approach called FANTASE to address these limitations.
arXiv Detail & Related papers (2024-07-18T23:44:02Z)
WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment [49.00213183302225]
We propose a framework to induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4.
arXiv Detail & Related papers (2024-07-10T15:52:44Z)
A Solution-based LLM API-using Methodology for Academic Information Seeking [49.096714812902576]
SoAy is a solution-based LLM API-using methodology for academic information seeking. It uses code with a solution as the reasoning method, where a solution is a pre-constructed API calling sequence. Results show a 34.58-75.99% performance improvement compared to state-of-the-art LLM API-based baselines.
arXiv Detail & Related papers (2024-05-24T02:44:14Z)
Compositional API Recommendation for Library-Oriented Code Generation [23.355509276291198]
We propose CAPIR, which adopts a "divide-and-conquer" strategy to recommend APIs for coarse-grained requirements. We present two challenging benchmarks, RAPID (Recommend APIs based on Documentation) and LOCG (Library-Oriented Code Generation) Experimental results on these benchmarks, demonstrate the effectiveness of CAPIR in comparison to existing baselines.
arXiv Detail & Related papers (2024-02-29T18:27:27Z)
De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding [18.129031749321058]
Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks. LLMs are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs. This paper presents De-Hallucinator, a technique that grounds the predictions of an LLM through a novel combination of retrieving suitable API references.
arXiv Detail & Related papers (2024-01-03T12:09:43Z)
Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries. We propose a novel framework that emulates the process of programmers writing private code. We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z)
HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions [35.48276161473216]
We present HAPI, a longitudinal dataset of 1,761,417 instances of commercial ML API applications. Each instance consists of a query input for an API along with the API's output prediction/annotation and confidence scores.
arXiv Detail & Related papers (2022-09-18T01:52:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.