Related papers: Generating API Parameter Security Rules with LLM for API Misuse Detection

Generating API Parameter Security Rules with LLM for API Misuse Detection

URL: http://arxiv.org/abs/2409.09288v2
Date: Thu, 19 Sep 2024 04:15:40 GMT
Title: Generating API Parameter Security Rules with LLM for API Misuse Detection
Authors: Jinghua Liu, Yi Yang, Kai Chen, Miaoqian Lin,
Abstract summary: We present a new framework, named GPTAid, for automatic APSRs generation by analyzing API source code with LLM. We propose an execution feedback-checking approach based on the observation that security-critical API misuse is often caused by APSRs violations. GPTAid achieves a precision of 92.3% and generates 6 times more APSRs than state-of-the-art detectors.
Score: 26.28337534131051
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present a new framework, named GPTAid, for automatic APSRs generation by analyzing API source code with LLM and detecting API misuse caused by incorrect parameter use. To validate the correctness of the LLM-generated APSRs, we propose an execution feedback-checking approach based on the observation that security-critical API misuse is often caused by APSRs violations, and most of them result in runtime errors. Specifically, GPTAid first uses LLM to generate raw APSRs and the Right calling code, and then generates Violation code for each raw APSR by modifying the Right calling code using LLM. Subsequently, GPTAid performs dynamic execution on each piece of Violation code and further filters out the incorrect APSRs based on runtime errors. To further generate concrete APSRs, GPTAid employs a code differential analysis to refine the filtered ones. Particularly, as the programming language is more precise than natural language, GPTAid identifies the key operations within Violation code by differential analysis, and then generates the corresponding concrete APSR based on the aforementioned operations. These concrete APSRs could be precisely interpreted into applicable detection code, which proven to be effective in API misuse detection. Implementing on the dataset containing 200 randomly selected APIs from eight popular libraries, GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRs than state-of-the-art detectors on a comparison dataset of previously reported bugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknown security bugs were found potentially resulting in severe security issues (e.g., system crashes), 150 of which have been confirmed by developers after our reports.

Related papers

Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [84.30534714651093]
We present an innovative APR tool for Dafny, a verification-aware programming language.<n>We localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program.<n>We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive. We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
Identifying and Mitigating API Misuse in Large Language Models [26.4403427473915]
API misuse in code generated by large language models (LLMs) represents a serious emerging challenge in software development. This paper presents the first comprehensive study of API misuse patterns in LLM-generated code, analyzing both method selection and parameter usage across Python and Java. We propose Dr.Fix, a novel LLM-based automatic program repair approach for API misuse based on the aforementioned taxonomy.
arXiv Detail & Related papers (2025-03-28T18:43:12Z)
ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution. We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
Evaluating and Improving the Robustness of Security Attack Detectors Generated by LLMs [6.936401700600395]
Large Language Models (LLMs) are increasingly used in software development to generate functions, such as attack detectors, that implement security requirements. This is most likely due to the LLM lacking knowledge about some existing attacks and to the generated code being not evaluated in real usage scenarios. We propose a novel approach integrating Retrieval Augmented Generation (RAG) and Self-Ranking into the LLM pipeline.
arXiv Detail & Related papers (2024-11-27T10:48:37Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation [63.611024451010316]
Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. We propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems.
arXiv Detail & Related papers (2024-10-12T16:30:51Z)
REDO: Execution-Free Runtime Error Detection for COding Agents [3.9903610503301072]
Execution-free Error Detection for COding Agents (REDO) is a method that integrates runtime errors with static analysis tools. We demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and a 9.1% higher weighted F1 score.
arXiv Detail & Related papers (2024-10-10T18:06:29Z)
The Midas Touch: Triggering the Capability of LLMs for RM-API Misuse Detection [26.28337534131051]
ChatDetector fully automates documentation understanding for RM-API misuse detection. ChatDetector identifies 165 pairs of RM-APIs with a precision of 98.21% compared with the state-of-the-art API detectors.
arXiv Detail & Related papers (2024-09-14T09:11:18Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Can Generative Large Language Models Perform ASR Error Correction? [16.246481696611117]
generative large language models (LLMs) have been applied to a wide range of natural language processing tasks. In this paper we investigate using ChatGPT, a generative LLM, for ASR error correction. Experiments show that this generative LLM approach can yield performance gains for two different state-of-the-art ASR architectures.
arXiv Detail & Related papers (2023-07-09T13:38:25Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.