VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
- URL: http://arxiv.org/abs/2601.16238v1
- Date: Wed, 21 Jan 2026 19:29:00 GMT
- Title: VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
- Authors: Bing Xu, Terry Chen, Fengzhe Zhou, Tianqi Chen, Yangqing Jia, Vinod Grover, Haicheng Wu, Wei Liu, Craig Wittenbrink, Wen-mei Hwu, Roger Bringmann, Ming-Yu Liu, Luis Ceze, Michael Lightstone, Humphrey Shi,
- Abstract summary: "fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs.<n>We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact.
- Score: 42.56489784841984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: VIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on agent-run builds, tests, and differential checks, without per-change manual diff review. It implements a PyTorch-style eager tensor library with a C++20 core (CPU+CUDA), a torch-like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema-lite dispatcher, reverse-mode autograd, CUDA runtime (streams/events/graphs), a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI-assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test-suite composition, and summarize reproducible microbenchmarks from an accompanying AI-generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end-to-end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs; multi-GPU results are Blackwell-only and use an optional CUTLASS-based ring-allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - ATTest: Agent-Driven Tensor Testing for Deep Learning Library Modules [19.355376741404267]
Unit testing of Deep Learning (DL) libraries is challenging due to complex numerical semantics and implicit tensor constraints.<n>This paper proposes ATTest, an agent-driven testing framework for module-level unit test generation.
arXiv Detail & Related papers (2026-02-15T04:47:58Z) - Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control [61.155940786140455]
Reinforcement learning (RL) has shown promising results in active flow control (AFC)<n>Current AFC benchmarks rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi-agent support.<n>We introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC.
arXiv Detail & Related papers (2026-01-21T14:13:44Z) - Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity [2.7389338551082605]
We develop a benchmark to test Large Language Models (LLMs) for anticipating performance bottlenecks.<n>FLOPBench predicts single and double-precision FLOP counts for 577 kernels.<n>Our results positionFLOPBench as a focused testbed for developing LLM tooling.
arXiv Detail & Related papers (2025-12-04T01:03:20Z) - Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems [1.2289544895833646]
We present a framework for comparing multi-agent PyTorch optimization systems.<n>We show that exploit-heavy strategies perform best when paired with error-fixing agents.<n>The best implementation achieves an average 2.88x speedup on an H100 GPU.
arXiv Detail & Related papers (2025-11-21T05:37:38Z) - SWE-Bench-CL: Continual Learning for Coding Agents [0.0]
SWE-Bench-CL is a novel continual learning benchmark built on the human-verified SWE-Bench Verified dataset.<n>By organizing GitHub issues into chronologically ordered sequences that reflect natural repository evolution, SWE-Bench-CL enables direct evaluation of an agent's ability to accumulate experience.
arXiv Detail & Related papers (2025-06-13T07:11:14Z) - SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - Towards a high-performance AI compiler with upstream MLIR [34.89141656581549]
This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance.
We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from Packing and PyTorch.
arXiv Detail & Related papers (2024-04-15T10:35:50Z) - UncertaintyPlayground: A Fast and Simplified Python Library for
Uncertainty Estimation [0.0]
UncertaintyPlayground is a Python library built on PyTorch and GPyTorch for uncertainty estimation in supervised learning tasks.
The library offers fast training for Gaussian and multi-modal outcome distributions.
It can visualize the prediction intervals of one or more instances.
arXiv Detail & Related papers (2023-10-23T18:36:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.