Related papers: Fuzzing Deep Learning Compilers with HirGen

Fuzzing Deep Learning Compilers with HirGen

URL: http://arxiv.org/abs/2208.02193v5
Date: Wed, 21 Jun 2023 06:19:33 GMT
Title: Fuzzing Deep Learning Compilers with HirGen
Authors: Haoyang Ma, Qingchao Shen, Yongqiang Tian, Junjie Chen, Shing-Chi Cheung
Abstract summary: HirGen is an automated testing technique that aims to effectively expose coding mistakes in the optimization of high-level IR. HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed. Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours.
Score: 12.068825031724229
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deep Learning (DL) compilers are widely adopted to optimize advanced DL models for efficient deployment on diverse hardware. Their quality has profound effect on the quality of compiled DL models. A recent bug study shows that the optimization of high-level intermediate representation (IR) is the most error-prone compilation stage. Bugs in this stage are accountable for 44.92% of the whole collected ones. However, existing testing techniques do not consider high-level optimization related features (e.g. high-level IR), and are therefore weak in exposing bugs at this stage. To bridge this gap, we propose HirGen, an automated testing technique that aims to effectively expose coding mistakes in the optimization of high-level IR. The design of HirGen includes 1) three coverage criteria to generate diverse and valid computational graphs; 2) full use of high-level IRs language features to generate diverse IRs; 3) three test oracles inspired from both differential testing and metamorphic testing. HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed. Further, we construct four baselines using the state-of-the-art DL compiler fuzzers that can cover the high-level optimization stage. Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours. We further validate the usefulness of our proposed coverage criteria and test oracles in evaluation.

Related papers

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation [3.8244417073114003]
We propose Attention-based Self-guided Automatic Unit Test GenERation (AUGER) approach. AUGER contains two stages: defect detection and error triggering. It makes great improvements by 4.7% to 35.3% in terms of F1-score and Precision in defect detection. It can trigger 23 to 84 more errors than state-of-the-art (SOTA) approaches in unit test generation.
arXiv Detail & Related papers (2024-12-01T14:28:48Z)
Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing [31.165102332393964]
Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance. It is crucial to conduct robustness testing on LAPR techniques before their practical deployment. We propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques.
arXiv Detail & Related papers (2024-10-10T01:14:58Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
Constrained C-Test Generation via Mixed-Integer Programming [55.28927994487036]
This work proposes a novel method to generate C-Tests; a form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap. In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach. We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.
arXiv Detail & Related papers (2024-04-12T21:35:21Z)
Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler [14.259471945857431]
We investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains. We propose a black-box generative approach that creates input programs for the K1 and K2 compilers. Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers.
arXiv Detail & Related papers (2024-01-12T16:01:12Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, their large size makes their inference slow and computationally expensive. We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z)
HDCC: A Hyperdimensional Computing compiler for classification on embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code. name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend. To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z)
Finding Deep-Learning Compilation Bugs with NNSmith [20.082492391396933]
We propose a new fuzz testing approach for finding bugs in deep-learning compilers. Our core approach uses (i) light-weight operator specifications to generate diverse yet valid models, (ii) a gradient-based search process, and (iii) differential testing to identify bugs. We implemented this approach in NNSmith which has found 65 new bugs in the last seven months for TVM,RT, ONNXRuntime, and PyTorch. Of these 52 have been confirmed and 44 have been fixed by maintainers.
arXiv Detail & Related papers (2022-07-26T17:39:51Z)
Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation [20.519361342905775]
We propose Tzer, a practical fuzzing technique for the widely used TVM tensor compiler. Our results show that Tzer substantially outperforms existing fuzzing techniques on tensor compiler testing. To date, Tzer has detected 49 previously unknown bugs for TVM, with 37 bugs confirmed and 25 bugs fixed.
arXiv Detail & Related papers (2022-02-21T01:48:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.