ATTest: Agent-Driven Tensor Testing for Deep Learning Library Modules
- URL: http://arxiv.org/abs/2602.13987v1
- Date: Sun, 15 Feb 2026 04:47:58 GMT
- Title: ATTest: Agent-Driven Tensor Testing for Deep Learning Library Modules
- Authors: Zhengyu Zhan, Ye Shang, Jiawei Liu, Chunrong Fang, Quanjun Zhang, Zhenyu Chen,
- Abstract summary: Unit testing of Deep Learning (DL) libraries is challenging due to complex numerical semantics and implicit tensor constraints.<n>This paper proposes ATTest, an agent-driven testing framework for module-level unit test generation.
- Score: 19.355376741404267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The unit testing of Deep Learning (DL) libraries is challenging due to complex numerical semantics and implicit tensor constraints. Traditional Search-Based Software Testing (SBST) often suffers from semantic blindness, failing to satisfy the constraints of high-dimensional tensors, whereas Large Language Models (LLMs) struggle with cross-file context and unstable code modifications. This paper proposes ATTest, an agent-driven tensor testing framework for module-level unit test generation. ATTest orchestrates a seven-stage pipeline, which encompasses constraint extraction and an iterative "generation-validation-repair" loop, to maintain testing stability and mitigate context-window saturation. An evaluation on PyTorch and TensorFlow demonstrates that ATTest significantly outperforms state-of-the-art baselines such as PynguinML, achieving an average branch coverage of 55.60% and 54.77%, respectively. The results illustrate how agent-driven workflows bridge the semantic gap in numerical libraries while ensuring auditable test synthesis. Source code: https://github.com/iSEngLab/ATTest.git
Related papers
- VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents [42.56489784841984]
"fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs.<n>We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact.
arXiv Detail & Related papers (2026-01-21T19:29:00Z) - Constraint-Guided Unit Test Generation for Machine Learning Libraries [8.883254370291256]
Machine learning (ML) libraries such as PyTorch and tensors are essential for a wide range of modern applications.<n> Ensuring the correctness of ML libraries through testing is crucial.<n>In this paper, we present PynguinML, an approach that improves the Pynguin test generator to leverage these constraints.
arXiv Detail & Related papers (2025-10-10T08:02:15Z) - Training-Free Time Series Classification via In-Context Reasoning with LLM Agents [29.14242392533328]
Time series classification (TSC) spans diverse application scenarios, yet labeled data are often scarce.<n>We propose FETA, a multi-agent framework for training-free TSC via exemplar-based in-context reasoning.
arXiv Detail & Related papers (2025-10-07T14:07:43Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - Alignment with Fill-In-the-Middle for Enhancing Code Generation [56.791415642365415]
We propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases.<n>Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench.
arXiv Detail & Related papers (2025-08-27T03:15:53Z) - SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner [53.54568352375669]
We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD)<n>Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests.<n>We generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark.
arXiv Detail & Related papers (2025-06-10T17:23:33Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay [76.06127233986663]
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time.
This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist.
We propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch.
arXiv Detail & Related papers (2024-07-22T16:25:41Z) - Fix the Tests: Augmenting LLMs to Repair Test Cases with Static Collector and Neural Reranker [9.428021853841296]
We propose SYNTER, a novel approach to automatically repair obsolete test cases via precise and concise TROCtxs construction.
With the augmentation of constructed TROCtxs, hallucinations are reduced by 57.1%.
arXiv Detail & Related papers (2024-07-04T04:24:43Z) - Enhancing Differential Testing With LLMs For Testing Deep Learning Libraries [8.779035160734523]
This paper introduces an LLM-enhanced differential testing technique for DL libraries.<n>It addresses the challenges of finding alternative implementations for a given API and generating diverse test inputs.<n>It synthesizes counterparts for 1.84 times as many APIs as those found by state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-12T07:06:38Z) - Auditing AI models for Verified Deployment under Semantic Specifications [65.12401653917838]
AuditAI bridges the gap between interpretable formal verification and scalability.
We show how AuditAI allows us to obtain controlled variations for verification and certified training while addressing the limitations of verifying using only pixel-space perturbations.
arXiv Detail & Related papers (2021-09-25T22:53:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.