Related papers: A Case for Dataset Specific Profiling

A Case for Dataset Specific Profiling

URL: http://arxiv.org/abs/2208.03315v1
Date: Mon, 1 Aug 2022 18:38:05 GMT
Title: A Case for Dataset Specific Profiling
Authors: Seth Ockerman, John Wu, Christopher Stewart
Abstract summary: Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets. With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications. For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources.
Score: 0.9023847175654603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets. With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications. For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources. Benchmarking approaches used in practice use representative datasets to infer performance without actually executing models. While practicable, these approaches limit extensive dataset profiling to a few datasets and introduce bias that favors models suited for representative datasets. As a result, each dataset's unique characteristics are left unexplored and subpar models are selected based on inference from generalized datasets. This necessitates a new paradigm that introduces dataset profiling into the model selection process. To demonstrate the need for dataset-specific profiling, we answer two questions:(1) Can scientific datasets significantly permute the rank order of computational models compared to widely used representative datasets? (2) If so, could lightweight model execution improve benchmarking accuracy? Taken together, the answers to these questions lay the foundation for a new dataset-aware benchmarking paradigm.

Related papers

Generating Skyline Datasets for Data Science Models [11.454081868173725]
This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. We derive three feasible algorithms to generate skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms.
arXiv Detail & Related papers (2025-02-16T20:33:59Z)
SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation [29.454948190814765]
In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. We propose a framework to select the most semantically diverse and important dataset portion. We further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool.
arXiv Detail & Related papers (2024-09-20T19:17:52Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data [9.57464542357693]
This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering. We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset. After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
arXiv Detail & Related papers (2024-07-02T09:54:39Z)
Under the Hood of Tabular Data Generation Models: Benchmarks with Extensive Tuning [2.5168710814072894]
We investigate the impact of dataset-specific tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations.
arXiv Detail & Related papers (2024-06-18T07:27:38Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Revisiting Permutation Symmetry for Merging Models between Different Datasets [3.234560001579257]
We investigate the properties of merging models between different datasets. We find that the accuracy of the merged model decreases more significantly as the datasets diverge more. We show that condensed datasets created by dataset condensation can be used as substitutes for the original datasets.
arXiv Detail & Related papers (2023-06-09T03:00:34Z)
Synthetic Model Combination: An Instance-wise Approach to Unsupervised Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data. Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
A Proposal to Study "Is High Quality Data All We Need?" [8.122270502556374]
We propose an empirical study that examines how to select a subset of and/or create high quality benchmark data. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets.
arXiv Detail & Related papers (2022-03-12T10:50:13Z)
Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation [4.339613097080119]
In low-resource scenarios, artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. We compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size.
arXiv Detail & Related papers (2022-01-05T22:19:10Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.