A Case for Dataset Specific Profiling
- URL: http://arxiv.org/abs/2208.03315v1
- Date: Mon, 1 Aug 2022 18:38:05 GMT
- Title: A Case for Dataset Specific Profiling
- Authors: Seth Ockerman, John Wu, Christopher Stewart
- Abstract summary: Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets.
With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications.
For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources.
- Score: 0.9023847175654603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-driven science is an emerging paradigm where scientific discoveries
depend on the execution of computational AI models against rich,
discipline-specific datasets. With modern machine learning frameworks, anyone
can develop and execute computational models that reveal concepts hidden in the
data that could enable scientific applications. For important and widely used
datasets, computing the performance of every computational model that can run
against a dataset is cost prohibitive in terms of cloud resources. Benchmarking
approaches used in practice use representative datasets to infer performance
without actually executing models. While practicable, these approaches limit
extensive dataset profiling to a few datasets and introduce bias that favors
models suited for representative datasets. As a result, each dataset's unique
characteristics are left unexplored and subpar models are selected based on
inference from generalized datasets. This necessitates a new paradigm that
introduces dataset profiling into the model selection process. To demonstrate
the need for dataset-specific profiling, we answer two questions:(1) Can
scientific datasets significantly permute the rank order of computational
models compared to widely used representative datasets? (2) If so, could
lightweight model execution improve benchmarking accuracy? Taken together, the
answers to these questions lay the foundation for a new dataset-aware
benchmarking paradigm.
Related papers
- SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation [29.454948190814765]
In recent years, the data collected for artificial intelligence has grown to an unmanageable amount.
We propose a framework to select the most semantically diverse and important dataset portion.
We further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool.
arXiv Detail & Related papers (2024-09-20T19:17:52Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data [9.57464542357693]
This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering.
We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset.
After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
arXiv Detail & Related papers (2024-07-02T09:54:39Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Revisiting Permutation Symmetry for Merging Models between Different
Datasets [3.234560001579257]
We investigate the properties of merging models between different datasets.
We find that the accuracy of the merged model decreases more significantly as the datasets diverge more.
We show that condensed datasets created by dataset condensation can be used as substitutes for the original datasets.
arXiv Detail & Related papers (2023-06-09T03:00:34Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - A Proposal to Study "Is High Quality Data All We Need?" [8.122270502556374]
We propose an empirical study that examines how to select a subset of and/or create high quality benchmark data.
We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets.
arXiv Detail & Related papers (2022-03-12T10:50:13Z) - Data-driven Model Generalizability in Crosslinguistic Low-resource
Morphological Segmentation [4.339613097080119]
In low-resource scenarios, artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental.
We compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families.
The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size.
arXiv Detail & Related papers (2022-01-05T22:19:10Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.