Related papers: Retrieval-Augmented Test Generation: How Far Are We?

Retrieval-Augmented Test Generation: How Far Are We?

URL: http://arxiv.org/abs/2409.12682v1
Date: Thu, 19 Sep 2024 11:48:29 GMT
Title: Retrieval-Augmented Test Generation: How Far Are We?
Authors: Jiho Shin, Reem Aleithan, Hadi Hemmati, Song Wang,
Abstract summary: Retrieval Augmented Generation (RAG) has shown notable advancements in software engineering tasks. To bridge this gap, we take the initiative to investigate the efficacy of RAG-based LLMs in test generation. Specifically, we examine RAG built upon three types of domain knowledge: 1) API documentation, 2) GitHub issues, and 3) StackOverflow Q&As.
Score: 8.84734567720785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval Augmented Generation (RAG) has shown notable advancements in software engineering tasks. Despite its potential, RAG's application in unit test generation remains under-explored. To bridge this gap, we take the initiative to investigate the efficacy of RAG-based LLMs in test generation. As RAGs can leverage various knowledge sources to enhance their performance, we also explore the impact of different sources of RAGs' knowledge bases on unit test generation to provide insights into their practical benefits and limitations. Specifically, we examine RAG built upon three types of domain knowledge: 1) API documentation, 2) GitHub issues, and 3) StackOverflow Q&As. Each source offers essential knowledge for creating tests from different perspectives, i.e., API documentations provide official API usage guidelines, GitHub issues offer resolutions of issues related to the APIs from the library developers, and StackOverflow Q&As present community-driven solutions and best practices. For our experiment, we focus on five widely used and typical Python-based machine learning (ML) projects, i.e., TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost to build, train, and deploy complex neural networks efficiently. We conducted experiments using the top 10% most widely used APIs across these projects, involving a total of 188 APIs. We investigate the effectiveness of four state-of-the-art LLMs (open and closed-sourced), i.e., GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llamma 3.1 405B. Additionally, we compare three prompting strategies in generating unit test cases for the experimental APIs, i.e., zero-shot, a Basic RAG, and an API-level RAG on the three external sources. Finally, we compare the cost of different sources of knowledge used for the RAG.

Related papers

Test Amplification for REST APIs via Single and Multi-Agent LLM Systems [1.6499388997661122]
We show how single-agent and multi-agent LLM systems can amplify a REST API test suite. Our evaluation demonstrates increased API coverage, identification of numerous bugs in the API under test, and insights into the computational cost and energy consumption of both approaches.
arXiv Detail & Related papers (2025-04-10T20:19:50Z)
When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers? [10.204379646375182]
Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge. We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation.
arXiv Detail & Related papers (2025-03-19T14:08:47Z)
Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors. Traditional fuzzing struggles with the complexity and API diversity of DL libraries. We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution. We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
Reinforcement Learning-Based REST API Testing with Multi-Coverage [4.127886193201882]
MUCOREST is a novel Reinforcement Learning (RL)-based API testing approach that leverages Q-learning to maximize code coverage and output coverage. MUCOREST significantly outperforms state-of-the-art API testing approaches by 11.6-261.1% in the number of discovered API bugs.
arXiv Detail & Related papers (2024-10-20T14:20:23Z)
APITestGenie: Automated API Test Generation through Generative AI [2.0716352593701277]
APITestGenie generates executable API test scripts from business requirements and API specifications. In experiments with 10 real-world APIs, the tool generated valid test scripts 57% of the time. Human intervention is recommended to validate or refine generated scripts before integration into CI/CD pipelines.
arXiv Detail & Related papers (2024-09-05T18:02:41Z)
RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation [54.707460684650584]
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG) RAGLAB is a modular and research-oriented open-source library that reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms.
arXiv Detail & Related papers (2024-08-21T07:20:48Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
Leveraging Large Language Models to Improve REST API Testing [51.284096009803406]
RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification. Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation.
arXiv Detail & Related papers (2023-12-01T19:53:23Z)
A Simple Baseline for Knowledge-Based Visual Question Answering [78.00758742784532]
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA) Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets.
arXiv Detail & Related papers (2023-10-20T15:08:17Z)
Automatic Unit Test Generation for Deep Learning Frameworks based on API Knowledge [11.523398693942413]
We propose MUTester to generate unit test cases for APIs of deep learning frameworks. We first propose a set of 18 rules for mining API constraints from the API documents. We then use the frequent itemset mining technique to mine the API usage patterns from a large corpus of machine learning API related code fragments.
arXiv Detail & Related papers (2023-07-01T18:34:56Z)
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge [155.81786738036578]
Open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs. Pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. We propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge.
arXiv Detail & Related papers (2023-05-30T08:34:13Z)
GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information [18.551792817140473]
We present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information. We prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm. GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83.
arXiv Detail & Related papers (2023-04-19T13:53:19Z)
KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT) All tasks in KILT are grounded in the same snapshot of Wikipedia. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.