Position Paper on Dataset Engineering to Accelerate Science
- URL: http://arxiv.org/abs/2303.05545v1
- Date: Thu, 9 Mar 2023 19:07:40 GMT
- Title: Position Paper on Dataset Engineering to Accelerate Science
- Authors: Emilio Vital Brazil, Eduardo Soares, Lucas Villa Real, Leonardo
Azevedo, Vinicius Segura, Luiz Zerkowski, and Renato Cerqueira
- Abstract summary: In this work, we will use the token ittextdataset to designate a structured set of data built to perform a well-defined task.
Specifically, in science, each area has unique forms to organize, gather and handle its datasets.
We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets.
- Score: 1.952708415083428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data is a critical element in any discovery process. In the last decades, we
observed exponential growth in the volume of available data and the technology
to manipulate it. However, data is only practical when one can structure it for
a well-defined task. For instance, we need a corpus of text broken into
sentences to train a natural language machine-learning model. In this work, we
will use the token \textit{dataset} to designate a structured set of data built
to perform a well-defined task. Moreover, the dataset will be used in most
cases as a blueprint of an entity that at any moment can be stored as a table.
Specifically, in science, each area has unique forms to organize, gather and
handle its datasets. We believe that datasets must be a first-class entity in
any knowledge-intensive process, and all workflows should have exceptional
attention to datasets' lifecycle, from their gathering to uses and evolution.
We advocate that science and engineering discovery processes are extreme
instances of the need for such organization on datasets, claiming for new
approaches and tooling. Furthermore, these requirements are more evident when
the discovery workflow uses artificial intelligence methods to empower the
subject-matter expert. In this work, we discuss an approach to bringing
datasets as a critical entity in the discovery process in science. We
illustrate some concepts using material discovery as a use case. We chose this
domain because it leverages many significant problems that can be generalized
to other science fields.
Related papers
- The Future of Data Science Education [0.11566458078238004]
The School of Data Science at the University of Virginia has developed a novel model for the definition of Data Science.
This paper will present the core features of the model and explain how it unifies various concepts going far beyond the analytics component of AI.
arXiv Detail & Related papers (2024-07-16T15:11:54Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science [4.120803087965204]
This paper presents a scalable platform, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections.
Based on this information, KGLiDS enables various downstream applications, such as data discovery and pipeline automation.
arXiv Detail & Related papers (2023-03-03T20:31:04Z) - A Vision for Semantically Enriched Data Science [19.604667287258724]
Key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation.
We envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation.
arXiv Detail & Related papers (2023-03-02T16:03:12Z) - Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning.
I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z) - Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain
Datasets [122.85598648289789]
We study how multi-domain and multi-task datasets can improve the learning of new tasks in new environments.
We also find that data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains.
arXiv Detail & Related papers (2021-09-27T23:42:12Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - Latent Feature Representation via Unsupervised Learning for Pattern
Discovery in Massive Electron Microscopy Image Volumes [4.278591555984395]
In particular, we give an unsupervised deep learning approach to learning a latent representation that captures semantic similarity in the data set.
We demonstrate the utility of our method applied to nano-scale electron microscopy data, where even relatively small portions of animal brains can require terabytes of image data.
arXiv Detail & Related papers (2020-12-22T17:14:19Z) - COG: Connecting New Skills to Past Experience with Offline Reinforcement
Learning [78.13740204156858]
We show that we can reuse prior data to extend new skills simply through dynamic programming.
We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task.
We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands.
arXiv Detail & Related papers (2020-10-27T17:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.