Related papers: GPT in Data Science: A Practical Exploration of Model Selection

GPT in Data Science: A Practical Exploration of Model Selection

URL: http://arxiv.org/abs/2311.11516v1
Date: Mon, 20 Nov 2023 03:42:24 GMT
Title: GPT in Data Science: A Practical Exploration of Model Selection
Authors: Nathalia Nascimento, Cristina Tavares, Paulo Alencar, Donald Cowan
Abstract summary: This research is committed to advancing our comprehension of AI decision-making processes. Our efforts are directed towards creating AI systems that are more transparent and comprehensible.
Score: 0.7646713951724013
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is an increasing interest in leveraging Large Language Models (LLMs) for managing structured data and enhancing data science processes. Despite the potential benefits, this integration poses significant questions regarding their reliability and decision-making methodologies. It highlights the importance of various factors in the model selection process, including the nature of the data, problem type, performance metrics, computational resources, interpretability vs accuracy, assumptions about data, and ethical considerations. Our objective is to elucidate and express the factors and assumptions guiding GPT-4's model selection recommendations. We employ a variability model to depict these factors and use toy datasets to evaluate both the model and the implementation of the identified heuristics. By contrasting these outcomes with heuristics from other platforms, our aim is to determine the effectiveness and distinctiveness of GPT-4's methodology. This research is committed to advancing our comprehension of AI decision-making processes, especially in the realm of model selection within data science. Our efforts are directed towards creating AI systems that are more transparent and comprehensible, contributing to a more responsible and efficient practice in data science.

Related papers

DSBC : Data Science task Benchmarking with Context engineering [0.0]
We introduce a benchmark to reflect real-world user interactions with data science agents.<n>We evaluate three large language models (LLMs) across three approaches.
arXiv Detail & Related papers (2025-07-31T08:32:37Z)
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study [55.09905978813599]
Large Language Models (LLMs) hold promise in automating data analysis tasks.<n>Yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios.<n>In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs.
arXiv Detail & Related papers (2025-06-24T17:04:23Z)
Data Heterogeneity Modeling for Trustworthy Machine Learning [25.732841312561586]
Data heterogeneity plays a pivotal role in determining the performance of machine learning (ML) systems.<n>Traditional algorithms often overlook the intrinsic diversity within datasets.<n>We show how a deeper understanding of data diversity can enhance model robustness, fairness, and reliability.
arXiv Detail & Related papers (2025-06-01T11:36:56Z)
Multimodal Generative AI for Story Point Estimation in Software Development [0.9831489366502301]
This research explores the application of Multimodal Generative AI to enhance story point estimation in Agile software development.<n>By integrating text, image, and categorical data using advanced models like BERT, CNN, and XGBoost, our approach surpasses the limitations of traditional single-modal estimation methods.
arXiv Detail & Related papers (2025-05-22T06:40:41Z)
Mitigating Attrition: Data-Driven Approach Using Machine Learning and Data Engineering [0.0]
This paper presents a novel data-driven approach to mitigating employee attrition using machine learning and data engineering techniques. The proposed framework integrates data from various human resources systems and leverages advanced feature engineering to capture a comprehensive set of factors influencing attrition.
arXiv Detail & Related papers (2025-02-25T05:29:45Z)
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement. We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z)
Addressing Heterogeneity in Federated Learning: Challenges and Solutions for a Shared Production Environment [1.2499537119440245]
Federated learning (FL) has emerged as a promising approach to training machine learning models across decentralized data sources. This paper provides a comprehensive overview of data heterogeneity in FL within the context of manufacturing. We discuss the impact of these types of heterogeneity on model training and review current methodologies for mitigating their adverse effects.
arXiv Detail & Related papers (2024-08-18T17:49:44Z)
A review of feature selection strategies utilizing graph data structures and knowledge graphs [1.9570926122713395]
Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.
arXiv Detail & Related papers (2024-06-21T04:50:02Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
Extending Variability-Aware Model Selection with Bias Detection in Machine Learning Projects [0.7646713951724013]
This paper describes work on extending an adaptive variability-aware model selection method with bias detection in machine learning projects. The proposed approach aims to advance the state of the art by making explicit factors that influence model selection, particularly those related to bias, as well as their interactions.
arXiv Detail & Related papers (2023-11-23T22:08:29Z)
Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z)
Striving for data-model efficiency: Identifying data externalities on group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population. Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z)
Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study [0.0]
The FAIR Guiding Principles aim to improve the findability, accessibility, interoperability, and reusability of digital content by making them both human and machine actionable. These principles have not yet been broadly adopted in the domain of machine learning-based program analyses and optimizations for High-Performance Computing. We design a methodology to make HPC datasets and machine learning models FAIR after investigating existing FAIRness assessment and improvement techniques.
arXiv Detail & Related papers (2022-11-03T18:45:46Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap. We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)
Principles and Practice of Explainable Machine Learning [12.47276164048813]
This report focuses on data-driven methods -- machine learning (ML) and pattern recognition models in particular. With the increasing prevalence and complexity of methods, business stakeholders in the very least have a growing number of concerns about the drawbacks of models. We have undertaken a survey to help industry practitioners understand the field of explainable machine learning better.
arXiv Detail & Related papers (2020-09-18T14:50:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.