A Benchmarking Framework for Model Datasets
- URL: http://arxiv.org/abs/2603.05250v1
- Date: Thu, 05 Mar 2026 15:04:35 GMT
- Title: A Benchmarking Framework for Model Datasets
- Authors: Philipp-Lorenz Glaser, Lola BurgueƱo, Dominik Bork,
- Abstract summary: Empirical and LLM-based research in model-driven engineering increasingly relies on datasets of software models.<n>Such datasets are typically collected or created ad hoc and without guarantees of their quality for the specific task for which they are used.<n>We propose a Benchmark Platform for MDE that provides a unified infrastructure for systematically assessing and comparing datasets of software models across languages and formats.
- Score: 1.2234742322758418
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Empirical and LLM-based research in model-driven engineering increasingly relies on datasets of software models, for instance, to train or evaluate machine learning techniques for modeling support. These datasets have a significant impact on solution performance; hence, they should be treated and assessed as first-class artifacts. However, such datasets are typically collected or created ad hoc and without guarantees of their quality for the specific task for which they are used. This limits the comparability of results between studies, obscures dataset quality and representativeness, and leads to weak reproducibility and potential bias. In this work, we propose a benchmarking framework for model datasets (i.e., benchmarking the dataset itself). Benchmarking datasets involves systematically measuring their quality, representativeness, and suitability for specific tasks. To this end, we propose a Benchmark Platform for MDE that provides a unified infrastructure for systematically assessing and comparing datasets of software models across languages and formats, using defined criteria and metrics.
Related papers
- The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data [25.926467401802046]
Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities.<n>We propose a framework for evaluating synthetic data from two dimensions: quality and trustworthiness.
arXiv Detail & Related papers (2026-01-25T06:40:25Z) - OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading [3.7723788828505125]
This study investigates the transferability of state-of-the-art (SOTA) models trained on established datasets to an unexplored text dataset.<n>The primary goal of this work is to yield comprehensive insights into the potential applicability and adaptability of SOTA models.
arXiv Detail & Related papers (2025-08-19T05:45:02Z) - Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability [41.23032741638842]
This position paper advocates for the integration of systematic, descriptive-based evaluation metrics into the dataset review process.<n>We introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets.
arXiv Detail & Related papers (2025-06-02T15:31:52Z) - Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models.<n>We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results)<n>Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z) - Larger or Smaller Reward Margins to Select Preferences for Alignment? [47.11487070429289]
Preference learning is critical for aligning large language models with human values.<n>We introduce the alignment potential metric, which quantifies the gap from the model's current implicit reward margin to the target explicit reward margin.<n> Empirical results demonstrate that training on data selected by this metric consistently enhances alignment performance.
arXiv Detail & Related papers (2025-02-25T06:43:24Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.