Related papers: Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

URL: http://arxiv.org/abs/2512.04106v1
Date: Fri, 28 Nov 2025 12:19:31 GMT
Title: Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection
Authors: Fouad Trad, Ali Chehab,
Abstract summary: Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models.<n>We examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection.
Score: 0.8737375836744933
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models (LLMs) in specialized tasks. However, its effectiveness depends heavily on the selection and quality of in-context examples, particularly in complex domains. In this work, we examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection, where the goal is to identify one or more security-relevant weaknesses present in a given code snippet from a predefined set of vulnerability categories. We perform a systematic evaluation using the Gemini-1.5-Flash model across three approaches: (1) standard few-shot prompting with randomly selected examples, (2) retrieval-augmented prompting using semantically similar examples, and (3) retrieval-based labeling, which assigns labels based on retrieved examples without model inference. Our results show that retrieval-augmented prompting consistently outperforms the other prompting strategies. At 20 shots, it achieves an F1 score of 74.05% and a partial match accuracy of 83.90%. We further compare this approach against zero-shot prompting and several fine-tuned models, including Gemini-1.5-Flash and smaller open-source models such as DistilBERT, DistilGPT2, and CodeBERT. Retrieval-augmented prompting outperforms both zero-shot (F1 score: 36.35%, partial match accuracy: 20.30%) and fine-tuned Gemini (F1 score: 59.31%, partial match accuracy: 53.10%), while avoiding the training time and cost associated with model fine-tuning. On the other hand, fine-tuning CodeBERT yields higher performance (F1 score: 91.22%, partial match accuracy: 91.30%) but requires additional training, maintenance effort, and resources.

Related papers

On Randomness in Agentic Evals [6.177270420667714]
Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.<n>Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate.<n>We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected.
arXiv Detail & Related papers (2026-02-06T19:49:13Z)
Reinforcement Learning for Reasoning in Large Language Models with One Training Example [117.86853102104256]
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs)<n>We identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement.
arXiv Detail & Related papers (2025-04-29T09:24:30Z)
Probably Approximately Precision and Recall Learning [60.00180898830079]
A key challenge in machine learning is the prevalence of one-sided feedback.<n>We introduce a Probably Approximately Correct (PAC) framework in which hypotheses are set functions that map each input to a set of labels.<n>We develop new algorithms that learn from positive data alone, achieving optimal sample complexity in the realizable case.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
Batch-in-Batch: a new adversarial training framework for initial perturbation and sample selection [9.241737058291823]
Adrial training methods generate independent initial perturbation for adversarial samples from a simple uniform distribution. We propose a simple yet effective training framework called Batch-in-Batch to enhance models. We show that models trained within the BB framework consistently have higher adversarial accuracy across various adversarial settings.
arXiv Detail & Related papers (2024-06-06T13:34:43Z)
Investigating the Limitation of CLIP Models: The Worst-Performing Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z)
Exploring Small Language Models with Prompt-Learning Paradigm for Efficient Domain-Specific Text Classification [2.410463233396231]
Small language models (SLMs) offer significant customizability, adaptability, and cost-effectiveness for domain-specific tasks. In few-shot settings when prompt-based model fine-tuning is possible, T5-base, a typical SLM with 220M parameters, achieve approximately 75% accuracy with limited labeled data. In zero-shot settings with a fixed model, we underscore a pivotal observation that, although the GPT-3.5-turbo equipped with around 154B parameters garners an accuracy of 55.16%, the power of well designed prompts becomes evident.
arXiv Detail & Related papers (2023-09-26T09:24:46Z)
Improving Selective Visual Question Answering by Learning from Your Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong. We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z)
Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data [122.282521548393]
Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning.<n>We introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training.
arXiv Detail & Related papers (2023-05-09T07:00:17Z)
(Certified!!) Adversarial Robustness for Free! [116.6052628829344]
We certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within a 2-norm of 0.5. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.
arXiv Detail & Related papers (2022-06-21T17:27:27Z)
Efficient, Uncertainty-based Moderation of Neural Networks Text Classifiers [8.883733362171034]
We propose a framework for the efficient, in-operation moderation of classifiers' output. We suggest a semi-automated approach that uses prediction uncertainties to pass unconfident, probably incorrect classifications to human moderators. A series of benchmarking experiments show that our framework can improve the classification F1-scores by 5.1 to 11.2%.
arXiv Detail & Related papers (2022-04-04T09:07:54Z)
MIO : Mutual Information Optimization using Self-Supervised Binary Contrastive Learning [12.365801596593936]
We model our pre-training task as a binary classification problem to induce an implicit contrastive effect.<n>Unlike existing methods, the proposed loss function optimize the mutual information in positive and negative pairs.<n>The proposed method outperforms SOTA self-supervised contrastive frameworks on benchmark datasets.
arXiv Detail & Related papers (2021-11-24T17:51:29Z)
Adaptive Verifiable Training Using Pairwise Class Similarity [17.89932271240133]
Verifiable training has shown success in creating neural networks that are provably robust to a given amount of noise. However, despite enforcing a single robustness criterion, its performance scales poorly with dataset complexity. We propose a new approach that utilizes inter-class similarity to improve the performance of verifiable training.
arXiv Detail & Related papers (2020-12-14T19:10:30Z)
Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck. We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
Frustratingly Simple Few-Shot Object Detection [98.42824677627581]
We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 220 points on current benchmarks.
arXiv Detail & Related papers (2020-03-16T00:29:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.