Subgraph-Oriented Testing for Deep Learning Libraries
- URL: http://arxiv.org/abs/2412.06430v1
- Date: Mon, 09 Dec 2024 12:10:48 GMT
- Title: Subgraph-Oriented Testing for Deep Learning Libraries
- Authors: Xiaoyuan Xie, Yan Song, Songqiang Chen, Jinfu Chen,
- Abstract summary: We propose SORT (Subgraph-Oriented Realistic Testing) to test Deep Learning (DL) libraries on different hardware platforms.
SORT takes popular API interaction patterns, represented as frequent subgraphs of model graphs, as test subjects.
SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing.
- Score: 9.78188667672054
- License:
- Abstract: Deep Learning (DL) libraries, such as PyTorch, are widely used for building and deploying DL models on various hardware platforms. Meanwhile, they are found to contain bugs that lead to incorrect calculation results and cause issues like non-convergence training and inaccurate prediction of DL models. Thus, many efforts have been made to test DL libraries and reveal bugs. However, existing DL library testing methods manifest limitations: model-level testing methods cause complexity in fault localization. Meanwhile, API-level testing methods often generate invalid inputs or primarily focus on extreme inputs that lead to crash failures; they also ignore testing realistic API interactions. These limitations may lead to missing detection of bugs, even in the frequently used APIs. To address these limitations, we propose SORT (Subgraph-Oriented Realistic Testing) to differential test DL libraries on different hardware platforms. SORT takes popular API interaction patterns, represented as frequent subgraphs of model computation graphs, as test subjects. In this way, it introduces realistic API interaction sequences while maintaining efficiency in locating faulty APIs for observed errors. Besides, SORT prepares test inputs by referring to extensive features of runtime inputs for each API in executing real-life benchmark data. The generated inputs are expected to better simulate such valid real inputs and reveal bugs more likely to happen in real-life usage. Evaluation on 728 frequent subgraphs of 49 popular PyTorch models demonstrates that SORT achieves a 100\% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing. 18 precision bugs in PyTorch are identified.
Related papers
- LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom LLMs to generate realistic test inputs.
LlamaRestTest surpasses state-of-the-art tools in code coverage and error detection, even with RESTGPT-enhanced specifications.
arXiv Detail & Related papers (2025-01-15T05:51:20Z) - Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.
Traditional fuzzing struggles with the complexity and API diversity of DL libraries.
We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z) - Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem.
A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - Reinforcement Learning-Based REST API Testing with Multi-Coverage [4.127886193201882]
MUCOREST is a novel Reinforcement Learning (RL)-based API testing approach that leverages Q-learning to maximize code coverage and output coverage.
MUCOREST significantly outperforms state-of-the-art API testing approaches by 11.6-261.1% in the number of discovered API bugs.
arXiv Detail & Related papers (2024-10-20T14:20:23Z) - STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay [76.06127233986663]
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time.
This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist.
We propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch.
arXiv Detail & Related papers (2024-07-22T16:25:41Z) - KAT: Dependency-aware Automated API Testing with Large Language Models [1.7264233311359707]
KAT (Katalon API Testing) is a novel AI-driven approach that autonomously generates test cases to validate APIs.
Our evaluation of KAT using 12 real-world services shows that it can improve validation coverage, detect more undocumented status codes, and reduce false positives in these services.
arXiv Detail & Related papers (2024-07-14T14:48:18Z) - CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [36.34154201748415]
Existing deep learning (DL) framework testing tools have limited coverage on bug types.
We propose Citadel, a method that accelerates the finding of bugs in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-18T01:51:16Z) - DLLens: Testing Deep Learning Libraries via LLM-aided Synthesis [8.779035160734523]
Testing is a major approach to ensuring the quality of deep learning (DL) libraries.
Existing testing techniques commonly adopt differential testing to relieve the need for test oracle construction.
This paper introduces thatens, a novel differential testing technique for DL library testing.
arXiv Detail & Related papers (2024-06-12T07:06:38Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.