A FAIR and AI-ready Higgs Boson Decay Dataset
- URL: http://arxiv.org/abs/2108.02214v1
- Date: Wed, 4 Aug 2021 18:00:03 GMT
- Title: A FAIR and AI-ready Higgs Boson Decay Dataset
- Authors: Yifan Chen, E. A. Huerta, Javier Duarte, Philip Harris, Daniel S.
Katz, Mark S. Neubauer, Daniel Diaz, Farouk Mokhtar, Raghav Kansal, Sang Eon
Park, Volodymyr V. Kindratenko, Zhizhen Zhao and Roger Rusack
- Abstract summary: This article provides a step-by-step assessment guide to evaluate whether a given dataset meets each FAIR principle.
We then demonstrate how to use this guide to evaluate the FAIRness of an open simulated dataset produced by the CMS Collaboration at the CERN Large Hadron Collider.
This study marks the first in a planned series of articles that will guide scientists in the creation and quantification of FAIRness in high energy particle physics datasets and AI models.
- Score: 15.325110053200305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To enable the reusability of massive scientific datasets by humans and
machines, researchers aim to create scientific datasets that adhere to the
principles of findability, accessibility, interoperability, and reusability
(FAIR) for data and artificial intelligence (AI) models. This article provides
a domain-agnostic, step-by-step assessment guide to evaluate whether or not a
given dataset meets each FAIR principle. We then demonstrate how to use this
guide to evaluate the FAIRness of an open simulated dataset produced by the CMS
Collaboration at the CERN Large Hadron Collider. This dataset consists of Higgs
boson decays and quark and gluon background, and is available through the CERN
Open Data Portal. We also use other available tools to assess the FAIRness of
this dataset, and incorporate feedback from members of the FAIR community to
validate our results. This article is accompanied by a Jupyter notebook to
facilitate an understanding and exploration of the dataset, including
visualization of its elements. This study marks the first in a planned series
of articles that will guide scientists in the creation and quantification of
FAIRness in high energy particle physics datasets and AI models.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Dataset Mention Extraction in Scientific Articles Using Bi-LSTM-CRF Model [0.0]
We show that citing datasets is not a common or standard practice in spite of recent efforts by data repositories and funding agencies.
A potential solution to this problem is to automatically extract dataset mentions from scientific articles.
In this work, we propose to achieve such extraction by using a neural network based on a Bi-LSTM-CRF architecture.
arXiv Detail & Related papers (2024-05-21T18:12:37Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Multimodal Dataset from Harsh Sub-Terranean Environment with Aerosol
Particles for Frontier Exploration [55.41644538483948]
This paper introduces a multimodal dataset from the harsh and unstructured underground environment with aerosol particles.
It contains synchronized raw data measurements from all onboard sensors in Robot Operating System (ROS) format.
The focus of this paper is not only to capture both temporal and spatial data diversities but also to present the impact of harsh conditions on captured data.
arXiv Detail & Related papers (2023-04-27T20:21:18Z) - Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT [9.33544942080883]
This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science.
We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release.
We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
arXiv Detail & Related papers (2023-04-05T04:01:52Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - FAIR AI Models in High Energy Physics [16.744801048170732]
We propose a practical definition of FAIR principles for AI models in experimental high energy physics.
We describe a template for the application of these principles.
We report on the robustness of this FAIR AI model, its portability across hardware architectures and software frameworks, and its interpretability.
arXiv Detail & Related papers (2022-12-09T19:00:18Z) - FAIR principles for AI models, with a practical application for
accelerated high energy diffraction microscopy [1.9270896986812693]
We showcase how to create and share FAIR data and AI models within a unified computational framework.
We describe how this domain-agnostic computational framework may be harnessed to enable autonomous AI-driven discovery.
arXiv Detail & Related papers (2022-07-01T18:11:12Z) - Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body
Physics Research [0.0]
We establish a dataset of over $1.6times104$ experimental images of Bose-Einstein condensates containing solitonic excitations.
About 33 % of this dataset has manually assigned and carefully curated labels.
The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework.
arXiv Detail & Related papers (2022-05-17T09:53:16Z) - Paradigm selection for Data Fusion of SAR and Multispectral Sentinel
data applied to Land-Cover Classification [63.072664304695465]
In this letter, four data fusion paradigms, based on Convolutional Neural Networks (CNNs) are analyzed and implemented.
The goals are to provide a systematic procedure for choosing the best data fusion framework, resulting in the best classification results.
The procedure has been validated for land-cover classification but it can be transferred to other cases.
arXiv Detail & Related papers (2021-06-18T11:36:54Z) - First Full-Event Reconstruction from Imaging Atmospheric Cherenkov
Telescope Real Data with Deep Learning [55.41644538483948]
The Cherenkov Telescope Array is the future of ground-based gamma-ray astronomy.
Its first prototype telescope built on-site, the Large Size Telescope 1, is currently under commissioning and taking its first scientific data.
We present for the first time the development of a full-event reconstruction based on deep convolutional neural networks and its application to real data.
arXiv Detail & Related papers (2021-05-31T12:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.