GPT in Data Science: A Practical Exploration of Model Selection
- URL: http://arxiv.org/abs/2311.11516v1
- Date: Mon, 20 Nov 2023 03:42:24 GMT
- Title: GPT in Data Science: A Practical Exploration of Model Selection
- Authors: Nathalia Nascimento, Cristina Tavares, Paulo Alencar, Donald Cowan
- Abstract summary: This research is committed to advancing our comprehension of AI decision-making processes.
Our efforts are directed towards creating AI systems that are more transparent and comprehensible.
- Score: 0.7646713951724013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an increasing interest in leveraging Large Language Models (LLMs)
for managing structured data and enhancing data science processes. Despite the
potential benefits, this integration poses significant questions regarding
their reliability and decision-making methodologies. It highlights the
importance of various factors in the model selection process, including the
nature of the data, problem type, performance metrics, computational resources,
interpretability vs accuracy, assumptions about data, and ethical
considerations. Our objective is to elucidate and express the factors and
assumptions guiding GPT-4's model selection recommendations. We employ a
variability model to depict these factors and use toy datasets to evaluate both
the model and the implementation of the identified heuristics. By contrasting
these outcomes with heuristics from other platforms, our aim is to determine
the effectiveness and distinctiveness of GPT-4's methodology. This research is
committed to advancing our comprehension of AI decision-making processes,
especially in the realm of model selection within data science. Our efforts are
directed towards creating AI systems that are more transparent and
comprehensible, contributing to a more responsible and efficient practice in
data science.
Related papers
- Addressing Heterogeneity in Federated Learning: Challenges and Solutions for a Shared Production Environment [1.2499537119440245]
Federated learning (FL) has emerged as a promising approach to training machine learning models across decentralized data sources.
This paper provides a comprehensive overview of data heterogeneity in FL within the context of manufacturing.
We discuss the impact of these types of heterogeneity on model training and review current methodologies for mitigating their adverse effects.
arXiv Detail & Related papers (2024-08-18T17:49:44Z) - A review of feature selection strategies utilizing graph data structures and knowledge graphs [1.9570926122713395]
Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems.
This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability.
The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.
arXiv Detail & Related papers (2024-06-21T04:50:02Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Extending Variability-Aware Model Selection with Bias Detection in
Machine Learning Projects [0.7646713951724013]
This paper describes work on extending an adaptive variability-aware model selection method with bias detection in machine learning projects.
The proposed approach aims to advance the state of the art by making explicit factors that influence model selection, particularly those related to bias, as well as their interactions.
arXiv Detail & Related papers (2023-11-23T22:08:29Z) - Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data.
Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance.
There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Making Machine Learning Datasets and Models FAIR for HPC: A Methodology
and Case Study [0.0]
The FAIR Guiding Principles aim to improve the findability, accessibility, interoperability, and reusability of digital content by making them both human and machine actionable.
These principles have not yet been broadly adopted in the domain of machine learning-based program analyses and optimizations for High-Performance Computing.
We design a methodology to make HPC datasets and machine learning models FAIR after investigating existing FAIRness assessment and improvement techniques.
arXiv Detail & Related papers (2022-11-03T18:45:46Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap.
We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert.
Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Principles and Practice of Explainable Machine Learning [12.47276164048813]
This report focuses on data-driven methods -- machine learning (ML) and pattern recognition models in particular.
With the increasing prevalence and complexity of methods, business stakeholders in the very least have a growing number of concerns about the drawbacks of models.
We have undertaken a survey to help industry practitioners understand the field of explainable machine learning better.
arXiv Detail & Related papers (2020-09-18T14:50:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.