Related papers: Automatic, Expressive, and Scalable Fuzzing with Stitching

Automatic, Expressive, and Scalable Fuzzing with Stitching

URL: http://arxiv.org/abs/2602.18689v1
Date: Sat, 21 Feb 2026 01:48:17 GMT
Title: Automatic, Expressive, and Scalable Fuzzing with Stitching
Authors: Harrison Green, Fraser Brown, Claire Le Goues,
Abstract summary: We propose stitching, a technique that encodes API usage constraints in pieces that a fuzzer dynamically assembles at runtime.<n>We implement stitching in STITCH, using LLMs to automatically configure projects for fuzzing, synthesize a specification, triage crashes, and repair the specification itself.<n>We evaluated STITCH against four state-of-the-art tools on 33 benchmarks, where it achieved the highest code coverage on 21 and found 30 true-positive bugs compared to 10 by all other tools combined.
Score: 12.105597820462634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fuzzing is a powerful technique for finding bugs in software libraries, but scaling it remains difficult. Automated harness generation commits to fixed API sequences at synthesis time, limiting the behaviors each harness can test. Approaches that instead explore new sequences dynamically lack the expressiveness to model real-world usage constraints leading to false positives from straightforward API misuse. We propose stitching, a technique that encodes API usage constraints in pieces that a fuzzer dynamically assembles at runtime. A static type system governs how objects flow between blocks, while a dynamically-checked extrinsic typestate tracks arbitrary metadata across blocks, enabling specifications to express rich semantic constraints such as object state dependencies and cross-function preconditions. This allows a single specification to describe an open-ended space of valid API interactions that the fuzzer explores guided by coverage feedback. We implement stitching in STITCH, using LLMs to automatically configure projects for fuzzing, synthesize a specification, triage crashes, and repair the specification itself. We evaluated STITCH against four state-of-the-art tools on 33 benchmarks, where it achieved the highest code coverage on 21 and found 30 true-positive bugs compared to 10 by all other tools combined, with substantially higher precision (70% vs. 12% for the next-best LLM-based tool). Deployed automatically on 1365 widely used open-source projects, STITCH discovered 131 new bugs across 102 projects, 73 of which have already been patched.

Related papers

LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer [15.118579443741659]
We build on the observation that historical bug reports contain rich, underutilized information about silent bugs.<n>We leverage large language models (LLMs) to perform versatile yet controlled bug transfer for silent bug fuzzing.<n>This enables proactive detection of silent bugs by transferring high-risk contexts and oracle designs from known buggy to functionally similar target.
arXiv Detail & Related papers (2026-02-26T14:53:26Z)
AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [51.50565654314582]
Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
arXiv Detail & Related papers (2025-11-05T14:39:59Z)
May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs [20.03968975178177]
fuzz testing (Fuzzing) is a simple yet effective way to find bugs in Deep Learning (DL) frameworks.<n>We propose FUEL to effectively utilize the feedback information, which comprises two Large Language Models (LLMs): analysis LLM and generation LLM.<n>We show that FUEL can improve line code coverage of PyTorch and execution by 9.15% and 14.70% over state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-21T08:51:53Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.<n>Traditional fuzzing struggles with the complexity and API diversity of DL libraries.<n>We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
Subgraph-Oriented Testing for Deep Learning Libraries [9.78188667672054]
We propose SORT (Subgraph-Oriented Realistic Testing) to test Deep Learning (DL) libraries on different hardware platforms.<n>SORT takes popular API interaction patterns, represented as frequent subgraphs of model graphs, as test subjects.<n>SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing.
arXiv Detail & Related papers (2024-12-09T12:10:48Z)
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage [2.1626093085892144]
We create an automated hardware fuzzing framework called FuzzWiz. It includes parsing the RTL design module, converting it into C/C++ models, creating generic testbench with assertions, linking, and fuzzing. Our benchmarking results show that we could achieve around 90% of the coverage 10 times faster than traditional simulation regression based approach.
arXiv Detail & Related papers (2024-10-23T10:06:08Z)
AutoBencher: Towards Declarative Benchmark Construction [74.54640925146289]
We use AutoBencher to create datasets for math, multilinguality, knowledge, and safety.<n>The scalability of AutoBencher allows it to test fine-grained categories knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks.
arXiv Detail & Related papers (2024-07-11T10:03:47Z)
LineBreaker: Finding Token-Inconsistency Bugs with Large Language Models [37.995370535587575]
Token-inconsistency bugs (TIBs) involve the misuse of syntactically valid yet incorrect code tokens.<n>Traditional detection methods, such as static analysis and dynamic testing, often struggle with TIBs due to their versatile and context-dependent nature.<n>We introduce name, a novel and cascaded TIB detection system.
arXiv Detail & Related papers (2024-05-02T18:44:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.