Retrieval-Augmented Test Generation: How Far Are We?
- URL: http://arxiv.org/abs/2409.12682v2
- Date: Thu, 16 Oct 2025 18:18:55 GMT
- Title: Retrieval-Augmented Test Generation: How Far Are We?
- Authors: Jiho Shin, Nima Shiri Harzevili, Reem Aleithan, Hadi Hemmati, Song Wang,
- Abstract summary: We investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs.<n>We examine three domain-specific sources for RAG: API documentation (official guidelines), GitHub issues (developer-reported resolutions), and StackOverflow Q&As.<n>Our study focuses on five widely used Python-based ML/DL libraries, PyTorch, Scikit-learn, Google JAX, and XGBoost.
- Score: 10.473792371852015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.
Related papers
- Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation [20.31612139450269]
Testing pull requests (PRs) is critical to maintaining software quality.<n>Some PR-modified lines remain untested, leaving a "last-mile" regression test gap.<n>We present ChaCo, an LLM-based test augmentation technique that addresses this gap.
arXiv Detail & Related papers (2026-01-16T02:08:16Z) - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs [20.03968975178177]
fuzz testing (Fuzzing) is a simple yet effective way to find bugs in Deep Learning (DL) frameworks.<n>We propose FUEL to effectively utilize the feedback information, which comprises two Large Language Models (LLMs): analysis LLM and generation LLM.<n>We show that FUEL can improve line code coverage of PyTorch and execution by 9.15% and 14.70% over state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-21T08:51:53Z) - GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models [75.25348392263676]
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP)<n>We propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation.
arXiv Detail & Related papers (2025-05-26T08:18:33Z) - LRASGen: LLM-based RESTful API Specification Generation [3.420331911153286]
We propose a novel approach for generating the OpenAPI Specification (OAS) specifications for APIs using Large Language Models (LLMs)<n>Compared with existing tools and methods, LRASGen can generate the OASs, even when the implementation is incomplete (with partial code, annotations/comments, etc.)<n>LRASGen-generated specifications cover an average of 48.85% more missed entities than the developer-provided specifications.
arXiv Detail & Related papers (2025-04-23T15:52:50Z) - Test Amplification for REST APIs via Single and Multi-Agent LLM Systems [1.6499388997661122]
We show how single-agent and multi-agent LLM systems can amplify a REST API test suite.
Our evaluation demonstrates increased API coverage, identification of numerous bugs in the API under test, and insights into the computational cost and energy consumption of both approaches.
arXiv Detail & Related papers (2025-04-10T20:19:50Z) - Issue2Test: Generating Reproducing Test Cases from Issue Reports [17.854783249394913]
A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue.<n>This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report.<n>We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 32.9% of the issues.
arXiv Detail & Related papers (2025-03-20T16:44:00Z) - When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers? [10.204379646375182]
Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge.
We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation.
arXiv Detail & Related papers (2025-03-19T14:08:47Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z) - Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.
Traditional fuzzing struggles with the complexity and API diversity of DL libraries.
We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z) - ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution.
We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z) - Reinforcement Learning-Based REST API Testing with Multi-Coverage [4.127886193201882]
MUCOREST is a novel Reinforcement Learning (RL)-based API testing approach that leverages Q-learning to maximize code coverage and output coverage.
MUCOREST significantly outperforms state-of-the-art API testing approaches by 11.6-261.1% in the number of discovered API bugs.
arXiv Detail & Related papers (2024-10-20T14:20:23Z) - APITestGenie: Automated API Test Generation through Generative AI [2.0716352593701277]
APITestGenie generates executable API test scripts from business requirements and API specifications.
In experiments with 10 real-world APIs, the tool generated valid test scripts 57% of the time.
Human intervention is recommended to validate or refine generated scripts before integration into CI/CD pipelines.
arXiv Detail & Related papers (2024-09-05T18:02:41Z) - RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation [54.707460684650584]
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention.
Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG)
RAGLAB is a modular and research-oriented open-source library that reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms.
arXiv Detail & Related papers (2024-08-21T07:20:48Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Leveraging Large Language Models to Improve REST API Testing [51.284096009803406]
RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification.
Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation.
arXiv Detail & Related papers (2023-12-01T19:53:23Z) - A Simple Baseline for Knowledge-Based Visual Question Answering [78.00758742784532]
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA)
Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline.
Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets.
arXiv Detail & Related papers (2023-10-20T15:08:17Z) - Automatic Unit Test Generation for Deep Learning Frameworks based on API
Knowledge [11.523398693942413]
We propose MUTester to generate unit test cases for APIs of deep learning frameworks.
We first propose a set of 18 rules for mining API constraints from the API documents.
We then use the frequent itemset mining technique to mine the API usage patterns from a large corpus of machine learning API related code fragments.
arXiv Detail & Related papers (2023-07-01T18:34:56Z) - Generate then Select: Open-ended Visual Question Answering Guided by
World Knowledge [155.81786738036578]
Open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs.
Pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources.
We propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge.
arXiv Detail & Related papers (2023-05-30T08:34:13Z) - GeneGPT: Augmenting Large Language Models with Domain Tools for Improved
Access to Biomedical Information [18.551792817140473]
We present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information.
We prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm.
GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83.
arXiv Detail & Related papers (2023-04-19T13:53:19Z) - Fuzzing Deep Learning Compilers with HirGen [12.068825031724229]
HirGen is an automated testing technique that aims to effectively expose coding mistakes in the optimization of high-level IR.
HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed.
Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours.
arXiv Detail & Related papers (2022-08-03T16:26:30Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.