Related papers: Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study

Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study

URL: http://arxiv.org/abs/2211.02092v1
Date: Thu, 3 Nov 2022 18:45:46 GMT
Title: Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study
Authors: Pei-Hung Lin, Chunhua Liao, Winson Chen, Tristan Vanderbruggen, Murali Emani, Hailu Xu
Abstract summary: The FAIR Guiding Principles aim to improve the findability, accessibility, interoperability, and reusability of digital content by making them both human and machine actionable. These principles have not yet been broadly adopted in the domain of machine learning-based program analyses and optimizations for High-Performance Computing. We design a methodology to make HPC datasets and machine learning models FAIR after investigating existing FAIRness assessment and improvement techniques.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The FAIR Guiding Principles aim to improve the findability, accessibility, interoperability, and reusability of digital content by making them both human and machine actionable. However, these principles have not yet been broadly adopted in the domain of machine learning-based program analyses and optimizations for High-Performance Computing (HPC). In this paper, we design a methodology to make HPC datasets and machine learning models FAIR after investigating existing FAIRness assessment and improvement techniques. Our methodology includes a comprehensive, quantitative assessment for elected data, followed by concrete, actionable suggestions to improve FAIRness with respect to common issues related to persistent identifiers, rich metadata descriptions, license and provenance information. Moreover, we select a representative training dataset to evaluate our methodology. The experiment shows the methodology can effectively improve the dataset and model's FAIRness from an initial score of 19.1% to the final score of 83.0%.

Related papers

Active Learning Methods for Efficient Data Utilization and Model Performance Enhancement [5.4044723481768235]
This paper gives a detailed overview of Active Learning (AL), which is a strategy in machine learning that helps models achieve better performance using fewer labeled examples. It introduces the basic concepts of AL and discusses how it is used in various fields such as computer vision, natural language processing, transfer learning, and real-world applications.
arXiv Detail & Related papers (2025-04-21T20:42:13Z)
Mitigating Attrition: Data-Driven Approach Using Machine Learning and Data Engineering [0.0]
This paper presents a novel data-driven approach to mitigating employee attrition using machine learning and data engineering techniques. The proposed framework integrates data from various human resources systems and leverages advanced feature engineering to capture a comprehensive set of factors influencing attrition.
arXiv Detail & Related papers (2025-02-25T05:29:45Z)
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement. We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z)
KRAIL: A Knowledge-Driven Framework for Base Human Reliability Analysis Integrating IDHEAS and Large Language Models [2.7378790256389047]
This paper introduces a novel two-stage framework for knowledge-driven reliability analysis, integrating IDHEAS and LLMs (KRAIL) Inspired by the success of large language models (LLMs) in natural language processing, this paper introduces a novel two-stage framework for knowledge-driven reliability analysis.
arXiv Detail & Related papers (2024-12-20T06:21:34Z)
User-centric evaluation of explainability of AI with and for humans: a comprehensive empirical study [5.775094401949666]
This study is located in the Human-Centered Artificial Intelligence (HCAI) It focuses on the results of a user-centered assessment of commonly used eXplainable Artificial Intelligence (XAI) algorithms.
arXiv Detail & Related papers (2024-10-21T12:32:39Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
An explainable machine learning-based approach for analyzing customers' online data to identify the importance of product attributes [0.6437284704257459]
We propose a game theory machine learning (ML) method that extracts comprehensive design implications for product development. We apply our method to a real-world dataset of laptops from Kaggle, and derive design implications based on the results.
arXiv Detail & Related papers (2024-02-03T20:50:48Z)
GPT in Data Science: A Practical Exploration of Model Selection [0.7646713951724013]
This research is committed to advancing our comprehension of AI decision-making processes. Our efforts are directed towards creating AI systems that are more transparent and comprehensible.
arXiv Detail & Related papers (2023-11-20T03:42:24Z)
Latent Properties of Lifelong Learning Systems [59.50307752165016]
We introduce an algorithm-agnostic explainable surrogate-modeling approach to estimate latent properties of lifelong learning algorithms. We validate the approach for estimating these properties via experiments on synthetic data.
arXiv Detail & Related papers (2022-07-28T20:58:13Z)
Efficient Real-world Testing of Causal Decision Making via Bayesian Experimental Design for Contextual Optimisation [12.37745209793872]
We introduce a model-agnostic framework for gathering data to evaluate and improve contextual decision making. Our method is used for the data-efficient evaluation of the regret of past treatment assignments.
arXiv Detail & Related papers (2022-07-12T01:20:11Z)
An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
A User's Guide to Calibrating Robotics Simulators [54.85241102329546]
This paper proposes a set of benchmarks and a framework for the study of various algorithms aimed to transfer models and policies learnt in simulation to the real world. We conduct experiments on a wide range of well known simulated environments to characterize and offer insights into the performance of different algorithms. Our analysis can be useful for practitioners working in this area and can help make informed choices about the behavior and main properties of sim-to-real algorithms.
arXiv Detail & Related papers (2020-11-17T22:24:26Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.