Related papers: Testing Deep Learning Libraries via Neurosymbolic Constraint Learning

Testing Deep Learning Libraries via Neurosymbolic Constraint Learning

URL: http://arxiv.org/abs/2601.15493v1
Date: Wed, 21 Jan 2026 21:54:41 GMT
Title: Testing Deep Learning Libraries via Neurosymbolic Constraint Learning
Authors: M M Abid Naziri, Shinhae Kim, Feiran Qin, Marcelo d'Amorim, Saikat Dutta,
Abstract summary: Deep Learning (DL) libraries (e.g., PyTorch) are popular in AI development.<n>A key challenge in testing DL libraries is the lack of API specifications.<n>We develop Centaur -- the first neurosymbolic technique to test DL library APIs using dynamically learned input constraints.
Score: 3.491101173753068
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Learning (DL) libraries (e.g., PyTorch) are popular in AI development. These libraries are complex and contain bugs. Researchers have proposed various bug-finding techniques for such libraries. Yet, there is much room for improvement. A key challenge in testing DL libraries is the lack of API specifications. Prior testing approaches often inaccurately model the input specifications of DL APIs, resulting in missed valid inputs that could reveal bugs or false alarms due to invalid inputs. To address this challenge, we develop Centaur -- the first neurosymbolic technique to test DL library APIs using dynamically learned input constraints. Centaur leverages the key idea that formal API constraints can be learned from a small number of automatically generated seed inputs, and that the learned constraints can be solved using SMT solvers to generate valid and diverse test inputs. We develop a novel grammar that represents first-order logic formulae over API parameters and expresses tensor-related properties (e.g., shape, data types) as well as relational properties between parameters. We use the grammar to guide a Large Language Model (LLM) to enumerate syntactically correct candidate rules, validated using seed inputs. Further, we develop a custom refinement strategy to prune the set of learned rules to eliminate spurious or redundant rules. We use the learned constraints to systematically generate valid and diverse inputs by integrating SMT-based solving with randomized sampling. We evaluate Centaur for testing PyTorch and TensorFlow. Our results show that Centaur's constraints have a recall of 94.0% and a precision of 94.0% on average. In terms of coverage, Centaur covers 203, 150, and 9,608 more branches than TitanFuzz, ACETest and Pathfinder, respectively. Using Centaur, we also detect 26 new bugs in PyTorch and TensorFlow, 18 of which are confirmed.

Related papers

Improving Deep Learning Library Testing with Machine Learning [40.21709249669499]
We explore using machine learning (ML) to determine input validity.<n>Shape relationships are a precise abstraction to encode concrete inputs and capture of the data.<n>We show that ML-enhanced input classification is an important aid to scale DL library testing.
arXiv Detail & Related papers (2026-02-03T17:19:01Z)
Constraint-Guided Unit Test Generation for Machine Learning Libraries [8.883254370291256]
Machine learning (ML) libraries such as PyTorch and tensors are essential for a wide range of modern applications.<n> Ensuring the correctness of ML libraries through testing is crucial.<n>In this paper, we present PynguinML, an approach that improves the Pynguin test generator to leverage these constraints.
arXiv Detail & Related papers (2025-10-10T08:02:15Z)
Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.<n>Traditional fuzzing struggles with the complexity and API diversity of DL libraries.<n>We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
LLM Based Input Space Partitioning Testing for Library APIs [13.070272424794744]
We present an LLM-based input space partitioning testing approach, LISP, for library API testing.<n>We evaluate LISP on more than 2,205 library API methods taken from 10 popular open-source Java libraries.<n>On average, LISP achieves 67.82% branch coverage, surpassing EvoSuite by 1.21 times.
arXiv Detail & Related papers (2024-12-15T17:50:50Z)
Subgraph-Oriented Testing for Deep Learning Libraries [9.78188667672054]
We propose SORT (Subgraph-Oriented Realistic Testing) to test Deep Learning (DL) libraries on different hardware platforms.<n>SORT takes popular API interaction patterns, represented as frequent subgraphs of model graphs, as test subjects.<n>SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing.
arXiv Detail & Related papers (2024-12-09T12:10:48Z)
ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution.<n> Experimental results demonstrate that ExploraCoder significantly improves performance for models lacking prior API knowledge.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
Enhancing Differential Testing With LLMs For Testing Deep Learning Libraries [8.779035160734523]
This paper introduces an LLM-enhanced differential testing technique for DL libraries.<n>It addresses the challenges of finding alternative implementations for a given API and generating diverse test inputs.<n>It synthesizes counterparts for 1.84 times as many APIs as those found by state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-12T07:06:38Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
Leveraging Large Language Models to Improve REST API Testing [51.284096009803406]
RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification. Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation.
arXiv Detail & Related papers (2023-12-01T19:53:23Z)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions. Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z)
AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information. Test-time adaptive (TTA) methods are proposed to address this issue. In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.