Related papers: A Tale of Two DL Cities: When Library Tests Meet Compiler

A Tale of Two DL Cities: When Library Tests Meet Compiler

URL: http://arxiv.org/abs/2407.16626v2
Date: Wed, 14 Aug 2024 14:25:37 GMT
Title: A Tale of Two DL Cities: When Library Tests Meet Compiler
Authors: Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, Zan Wang,
Abstract summary: We propose OPERA to extract domain knowledge from the test inputs for DL libraries. OPERA constructs diverse tests from the various test inputs for DL libraries. It incorporates a diversity-based test prioritization strategy to migrate and execute those test inputs.
Score: 12.751626834965231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Learning (DL) compilers typically load a DL model and optimize it with intermediate representation.Existing DL compiler testing techniques mainly focus on model optimization stages, but rarely explore bug detection at the model loading stage. Effectively testing the model loading stage requires covering diverse usages of each DL operator from various DL libraries, which shares a common objective with DL library testing, indicating that the embedded knowledge in DL library tests is beneficial for testing the model loading stage of DL compilers. In this work, we propose OPERA to extract such domain knowledge from the test inputs for DL libraries. OPERA constructs diverse tests from the various test inputs for DL libraries (including the test inputs documented in DL libraries and those generated by recent fuzzers). In addition, it incorporates a diversity-based test prioritization strategy to migrate and execute those test inputs that are more likely to detect diverse bugs earlier. We considered three sources of tests in DL libraries for migration and used eight frontends from three DL compilers (e.g., TVM, TensorRT, and OpenVINO) for evaluation. OPERA detected 170 previously unknown bugs in total, 90 of which have been confirmed/fixed by developers, demonstrating the effectiveness of such the migration-based idea. The test prioritization strategy in OPERA improves testing efficiency with migrated tests by 11.9%~47.4% on average compared to general test prioritization strategies.

Related papers

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs. We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach. Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z)
LLMBox: A Comprehensive Library for Large Language Models [109.15654830320553]
This paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of large language models (LLMs) This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency.
arXiv Detail & Related papers (2024-07-08T02:39:33Z)
DLLens: Testing Deep Learning Libraries via LLM-aided Synthesis [8.779035160734523]
Testing is a major approach to ensuring the quality of deep learning (DL) libraries. Existing testing techniques commonly adopt differential testing to relieve the need for test oracle construction. This paper introduces thatens, a novel differential testing technique for DL library testing.
arXiv Detail & Related papers (2024-06-12T07:06:38Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
A Survey of Deep Learning Library Testing Methods [33.62859142913532]
Deep learning (DL) libraries undertake the underlying optimization and computation. DL libraries are not immune to bugs, which can pose serious threats to users' personal property and safety. This paper provides an overview of the testing research related to various DL libraries.
arXiv Detail & Related papers (2024-04-27T11:42:13Z)
Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects [24.712437703214547]
Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. It is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. We empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub.
arXiv Detail & Related papers (2024-02-26T13:08:44Z)
Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [64.62570402941387]
We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe.
arXiv Detail & Related papers (2023-11-02T17:59:32Z)
ACETest: Automated Constraint Extraction for Testing Deep Learning Operators [23.129431525952263]
It is essential that the test cases pass the input validity check and are able to reach the core function logic of the operators. Existing techniques rely on either human effort or documentation of DL library APIs to extract the constraints. We propose ACETest, a technique to automatically extract input validation constraints from the code to build valid yet diverse test cases.
arXiv Detail & Related papers (2023-05-29T06:49:40Z)
Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark] [65.11858854040544]
We perform a thorough experimental analysis of 12 popular language models over 17 established benchmark datasets. First, we assess their vectorization overhead for converting all input entities into dense embeddings vectors. Second, we investigate their blocking performance, performing a detailed scalability analysis, and comparing them with the state-of-the-art deep learning-based blocking method. Third, we conclude with their relative performance for both supervised and unsupervised matching.
arXiv Detail & Related papers (2023-04-24T08:53:54Z)
MEMO: Coverage-guided Model Generation For Deep Learning Library Testing [11.263121366956726]
A few techniques have been proposed to test deep learning (DL) libraries by generating DL models as test inputs. But the test effectiveness of these techniques is constrained by the diversity of generated DL models. We propose MEMO to efficiently generate diverse DL models by exploring layer types, layer pairs, and layer parameters.
arXiv Detail & Related papers (2022-08-02T14:53:02Z)
The MultiBERTs: BERT Reproductions for Robustness Analysis [86.29162676103385]
Re-running pretraining can lead to substantially different conclusions about performance. We introduce MultiBERTs: a set of 25 BERT-base checkpoints. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures.
arXiv Detail & Related papers (2021-06-30T15:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.