DataRaceBench V1.4.1 and DataRaceBench-ML V0.1: Benchmark Suites for
  Data Race Detection
        - URL: http://arxiv.org/abs/2308.08473v1
- Date: Wed, 16 Aug 2023 16:23:13 GMT
- Title: DataRaceBench V1.4.1 and DataRaceBench-ML V0.1: Benchmark Suites for
  Data Race Detection
- Authors: Le Chen, Wenhao Wu, Stephen F. Siegel, Pei-Hung Lin, Chunhua Liao
- Abstract summary: Data races pose a significant threat in multi-threaded parallel applications due to their negative impact on program correctness.
Open-source benchmark suite, DataRaceBench, is crafted to assess these data race detection tools in a systematic and measurable manner.
This paper introduces a derived dataset named DataRaceBench-ML (DRB-ML).
- Score: 23.240375422302666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Data races pose a significant threat in multi-threaded parallel applications
due to their negative impact on program correctness. DataRaceBench, an
open-source benchmark suite, is specifically crafted to assess these data race
detection tools in a systematic and measurable manner. Machine learning
techniques have recently demonstrated considerable potential in
high-performance computing (HPC) program analysis and optimization. However,
these techniques require specialized data formats for training and refinement.
This paper presents the latest update to DataRaceBench, incorporating new data
race contributions from Wu et al. \cite{wu2023model}, and introduces a derived
dataset named DataRaceBench-ML (DRB-ML) \cite{drbml}. DRB-ML aligns with the
emerging trend of machine learning and large language models. Originating from
DataRaceBench, this dataset includes detailed labels that denote the presence
of a data race and provides comprehensive details of associated variables, such
as variable names, line numbers, and the operation (read/write). Unique to
DRB-ML, we have also integrated a series of tailored prompt-response pairs
specifically designed for LLM fine-tuning.
 
      
        Related papers
        - Hey, That's My Data! Label-Only Dataset Inference in Large Language   Models [63.35066172530291]
 CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
 arXiv  Detail & Related papers  (2025-06-06T13:02:59Z)
- CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for   Language Model Pre-training [63.07024608399447]
 We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.
We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
 arXiv  Detail & Related papers  (2025-04-17T17:58:13Z)
- MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection   for Enhanced Visual Instruction Tuning [69.7347209018861]
 We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning.
We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance.
Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
 arXiv  Detail & Related papers  (2025-03-26T12:42:37Z)
- DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
 Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
 arXiv  Detail & Related papers  (2025-02-22T08:53:39Z)
- Training on the Benchmark Is Not All You Need [52.01920740114261]
 We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
 arXiv  Detail & Related papers  (2024-09-03T11:09:44Z)
- An Integrated Data Processing Framework for Pretraining Foundation   Models [57.47845148721817]
 Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
 arXiv  Detail & Related papers  (2024-02-26T07:22:51Z)
- Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
 In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
 arXiv  Detail & Related papers  (2024-02-21T02:45:46Z)
- MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
 We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
 arXiv  Detail & Related papers  (2023-08-25T01:41:04Z)
- Data Race Detection Using Large Language Models [1.0013600887991827]
 Large language models (LLMs) are an alternative strategy to facilitate analyses and optimizations of high-performance computing programs.
In this paper, we explore a novel LLM-based data race detection approach combining prompting engineering and fine-tuning techniques.
 arXiv  Detail & Related papers  (2023-08-15T00:08:43Z)
- DataAssist: A Machine Learning Approach to Data Cleaning and Preparation [0.0]
 DataAssist is an automated data preparation and cleaning platform that enhances dataset quality using ML-informed methods.
Our tool is applicable to a variety of fields, including economics, business, and forecasting applications saving over 50% time of the time spent on data cleansing and preparation.
 arXiv  Detail & Related papers  (2023-07-14T01:50:53Z)
- AnnoLLM: Making Large Language Models to Be Better Crowdsourced   Annotators [98.11286353828525]
 GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
 arXiv  Detail & Related papers  (2023-03-29T17:03:21Z)
- Integrating Transformer and Autoencoder Techniques with Spectral Graph
  Algorithms for the Prediction of Scarcely Labeled Molecular Data [2.8360662552057323]
 This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge.
Specifically, graph-based modifications of the MBO scheme is integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder.
The proposed models are validated using five benchmark data sets.
 arXiv  Detail & Related papers  (2022-11-12T22:45:32Z)
- A domain-specific language for describing machine learning dataset [3.9576015470370893]
 This DSL describes datasets in terms of their structure, data provenance, and social concerns.
It is implemented as a Visual Studio Code plugin, and it has been published under an open source license.
 arXiv  Detail & Related papers  (2022-07-05T14:00:01Z)
- Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
 We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
 arXiv  Detail & Related papers  (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.