Related papers: CrashJS: A NodeJS Benchmark for Automated Crash Reproduction

CrashJS: A NodeJS Benchmark for Automated Crash Reproduction

URL: http://arxiv.org/abs/2405.05541v1
Date: Thu, 9 May 2024 04:57:10 GMT
Title: CrashJS: A NodeJS Benchmark for Automated Crash Reproduction
Authors: Philip Oliver, Jens Dietrich, Craig Anslow, Michael Homer,
Abstract summary: Software bugs often lead to software crashes, which cost US companies upwards of $2.08 trillion annually. Automated Crash Reproduction aims to generate unit tests that successfully reproduce a crash. CrashJS is a benchmark dataset of 453 Node.js crashes from several sources.
Score: 4.3560886861249255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software bugs often lead to software crashes, which cost US companies upwards of $2.08 trillion annually. Automated Crash Reproduction (ACR) aims to generate unit tests that successfully reproduce a crash. The goal of ACR is to aid developers with debugging, providing them with another tool to locate where a bug is in a program. The main approach ACR currently takes is to replicate a stack trace from an error thrown within a program. Currently, ACR has been developed for C, Java, and Python, but there are no tools targeting JavaScript programs. To aid the development of JavaScript ACR tools, we propose CrashJS: a benchmark dataset of 453 Node.js crashes from several sources. CrashJS includes a mix of real-world and synthesised tests, multiple projects, and different levels of complexity for both crashes and target programs.

Related papers

CrashFixer: A crash resolution agent for the Linux kernel [58.152358195983155]
This work builds upon kGym, which shares a benchmark for system-level Linux kernel bugs and a platform to run experiments on the Linux kernel. This paper introduces CrashFixer, the first LLM-based software repair agent that is applicable to Linux kernel bugs.
arXiv Detail & Related papers (2025-04-29T04:18:51Z)
CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation [63.23120252801889]
CRUST-Bench is a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting.
arXiv Detail & Related papers (2025-04-21T17:33:33Z)
Scalable and Accurate Application-Level Crash-Consistency Testing via Representative Testing [4.659174681934402]
We build Pathfinder, a crash-consistency testing tool that implements an update behaviors-based to approximate a small set of representative crash states. Pathfinder scales more effectively to large applications than prior works and finds 4x more bugs in POSIX-based applications and 8x more bugs in MMIO-based applications.
arXiv Detail & Related papers (2025-03-03T10:41:57Z)
A Preliminary Study of Fixed Flaky Tests in Rust Projects on GitHub [5.806051501952938]
We present our work-in-progress on studying flaky tests in Rust projects on GitHub. We focus on flaky tests that are fixed, not just reported, as the fixes can offer valuable information on root causes, manifestation characteristics, and strategies of fixes.
arXiv Detail & Related papers (2025-02-04T22:55:54Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Mutation-Based Deep Learning Framework Testing Method in JavaScript Environment [16.67312523556796]
We propose a mutation-based JavaScript DL framework testing method named DLJSFuzzer. DLJSFuzzer successfully detects 21 unique crashes and unique 126 NaN & Inconsistency bugs. DLJSFuzzer has improved by over 47% in model generation efficiency and over 91% in bug detection efficiency compared to all baselines.
arXiv Detail & Related papers (2024-09-23T12:37:56Z)
AutoBencher: Towards Declarative Benchmark Construction [74.54640925146289]
We use AutoBencher to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks.
arXiv Detail & Related papers (2024-07-11T10:03:47Z)
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z)
Concolic Testing of JavaScript using Sparkplug [6.902028735328818]
Insitu concolic testing for JS is effective but slow and complex. Our method enhances tracing with V8 Sparkplug baseline compiler and remill libraries for assembly to LLVM IR conversion.
arXiv Detail & Related papers (2024-05-10T22:11:53Z)
CrashTranslator: Automatically Reproducing Mobile Application Crashes Directly from Stack Trace [30.48737611250448]
This paper proposes an approach named CrashTranslator to automatically reproduce mobile application crashes directly from the stack trace. We evaluate CrashTranslator on 75 crash reports involving 58 popular Android apps, and it successfully reproduces 61.3% of the crashes.
arXiv Detail & Related papers (2023-10-11T02:00:18Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents [51.94361026233668]
We introduce RCACopilot, an on-call system empowered by a large language model for automating root cause analysis of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft.
arXiv Detail & Related papers (2023-05-25T06:44:50Z)
RunBugRun -- An Executable Dataset for Automated Program Repair [15.670905650869704]
We present a fully executable dataset of 450,000 small buggy/fixed program pairs originally submitted to programming competition websites. We provide infrastructure to compile, safely execute and test programs as well as fine-grained bug-type labels.
arXiv Detail & Related papers (2023-04-03T16:02:00Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems. We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub. The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z)
S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning. It is based on a biLSTM encoder and a fully-connected classifier to compute similarity. Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.