DataRaceBench V1.4.1 and DataRaceBench-ML V0.1: Benchmark Suites for
Data Race Detection
- URL: http://arxiv.org/abs/2308.08473v1
- Date: Wed, 16 Aug 2023 16:23:13 GMT
- Title: DataRaceBench V1.4.1 and DataRaceBench-ML V0.1: Benchmark Suites for
Data Race Detection
- Authors: Le Chen, Wenhao Wu, Stephen F. Siegel, Pei-Hung Lin, Chunhua Liao
- Abstract summary: Data races pose a significant threat in multi-threaded parallel applications due to their negative impact on program correctness.
Open-source benchmark suite, DataRaceBench, is crafted to assess these data race detection tools in a systematic and measurable manner.
This paper introduces a derived dataset named DataRaceBench-ML (DRB-ML).
- Score: 23.240375422302666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data races pose a significant threat in multi-threaded parallel applications
due to their negative impact on program correctness. DataRaceBench, an
open-source benchmark suite, is specifically crafted to assess these data race
detection tools in a systematic and measurable manner. Machine learning
techniques have recently demonstrated considerable potential in
high-performance computing (HPC) program analysis and optimization. However,
these techniques require specialized data formats for training and refinement.
This paper presents the latest update to DataRaceBench, incorporating new data
race contributions from Wu et al. \cite{wu2023model}, and introduces a derived
dataset named DataRaceBench-ML (DRB-ML) \cite{drbml}. DRB-ML aligns with the
emerging trend of machine learning and large language models. Originating from
DataRaceBench, this dataset includes detailed labels that denote the presence
of a data race and provides comprehensive details of associated variables, such
as variable names, line numbers, and the operation (read/write). Unique to
DRB-ML, we have also integrated a series of tailored prompt-response pairs
specifically designed for LLM fine-tuning.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Data Race Detection Using Large Language Models [1.0013600887991827]
Large language models (LLMs) are an alternative strategy to facilitate analyses and optimizations of high-performance computing programs.
In this paper, we explore a novel LLM-based data race detection approach combining prompting engineering and fine-tuning techniques.
arXiv Detail & Related papers (2023-08-15T00:08:43Z) - DataAssist: A Machine Learning Approach to Data Cleaning and Preparation [0.0]
DataAssist is an automated data preparation and cleaning platform that enhances dataset quality using ML-informed methods.
Our tool is applicable to a variety of fields, including economics, business, and forecasting applications saving over 50% time of the time spent on data cleansing and preparation.
arXiv Detail & Related papers (2023-07-14T01:50:53Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Integrating Transformer and Autoencoder Techniques with Spectral Graph
Algorithms for the Prediction of Scarcely Labeled Molecular Data [2.8360662552057323]
This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge.
Specifically, graph-based modifications of the MBO scheme is integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder.
The proposed models are validated using five benchmark data sets.
arXiv Detail & Related papers (2022-11-12T22:45:32Z) - A domain-specific language for describing machine learning dataset [3.9576015470370893]
This DSL describes datasets in terms of their structure, data provenance, and social concerns.
It is implemented as a Visual Studio Code plugin, and it has been published under an open source license.
arXiv Detail & Related papers (2022-07-05T14:00:01Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.