Related papers: Towards Human-Guided, Data-Centric LLM Co-Pilots

Towards Human-Guided, Data-Centric LLM Co-Pilots

URL: http://arxiv.org/abs/2501.10321v2
Date: Fri, 24 Jan 2025 16:37:57 GMT
Title: Towards Human-Guided, Data-Centric LLM Co-Pilots
Authors: Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar,
Abstract summary: CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots.<n>It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing.<n>We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
Score: 53.35493881390917
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.

Related papers

Data Science and Technology Towards AGI Part I: Tiered Data Management [53.64581824953229]
We argue that the development of artificial intelligence is entering a new phase of data-model co-evolution.<n>We introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge.<n>We validate the effectiveness of the proposed framework through empirical studies.
arXiv Detail & Related papers (2026-02-09T18:47:51Z)
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z)
SoK: Data Minimization in Machine Learning [49.60064304454055]
Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task.<n>The relevance of data minimization is particularly pronounced in machine learning (ML) applications.<n>Existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection.<n>This work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization.
arXiv Detail & Related papers (2025-08-14T17:00:13Z)
TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes [25.05627023905607]
We envision a new multi-modal data analytics system based on the Model Context Protocol (MCP)<n>First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes.<n>Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities.
arXiv Detail & Related papers (2025-05-16T14:03:30Z)
TAMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data [33.5606443790794]
Large language models (LLMs) have made breakthroughs in contextual inference and domain knowledge integration. We propose a tool-assisted LLM agent with multi-modality observation data, namely TAMO, for fine-grained root cause analysis.
arXiv Detail & Related papers (2025-04-29T06:50:48Z)
From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System [49.57258257916805]
Large Language Models (LLMs) demonstrate strong zero-shot recommendation capabilities. Practical applications often favor smaller, internally managed recommender models due to scalability, interpretability, and data privacy constraints. We propose an active data augmentation framework that synthesizes conversational training data by leveraging black-box LLMs guided by active learning techniques.
arXiv Detail & Related papers (2025-04-21T23:05:47Z)
A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms.
arXiv Detail & Related papers (2025-03-08T05:41:42Z)
Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis [2.763670421921841]
We build a proof-of-concept multi-agent system called ADMA Copilot, which can understand user's intent. ADMA Copilot accomplishes tasks automatically, in which three agents: a LLM based controller, an input formatter and an output formatter collaborate.
arXiv Detail & Related papers (2024-10-31T20:15:14Z)
LAMBDA: A Large Model Based Data Agent [7.240586338370509]
We introduce LArge Model Based Data Agent (LAMBDA), a novel open-source, code-free multi-agent data analysis system. LAMBDA is designed to address data analysis challenges in complex data-driven applications. It has the potential to enhance data analysis paradigms by seamlessly integrating human and artificial intelligence.
arXiv Detail & Related papers (2024-07-24T06:26:36Z)
DeepFMEA -- A Scalable Framework Harmonizing Process Expertise and Data-Driven PHM [0.0]
In most industrial settings, data is often limited in quantity, and its quality can be inconsistent. To bridge this gap in practice, successfully industrialized PHM tools rely on the introduction of domain expertise as a prior. DeepFMEA draws inspiration from the Failure Mode and Effects Analysis (FMEA) in its structured approach to the analysis of any technical system.
arXiv Detail & Related papers (2024-05-13T09:41:34Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
Integration of Domain Expert-Centric Ontology Design into the CRISP-DM for Cyber-Physical Production Systems [45.05372822216111]
Methods from Machine Learning (ML) and Data Mining (DM) have proven to be promising in extracting complex and hidden patterns from the data collected. However, such data-driven projects, usually performed with the Cross-Industry Standard Process for Data Mining (CRISPDM), often fail due to the disproportionate amount of time needed for understanding and preparing the data. This contribution intends present an integrated approach so that data scientists are able to more quickly and reliably gain insights into the CPPS challenges.
arXiv Detail & Related papers (2023-07-21T15:04:00Z)
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily. We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
ChatGPT as your Personal Data Scientist [0.9689893038619583]
This paper introduces a ChatGPT-based conversational data-science framework to act as a "personal data scientist" Our model pivots around four dialogue states: Data visualization, Task Formulation, Prediction Engineering, and Result Summary and Recommendation. In summary, we developed an end-to-end system that not only proves the viability of the novel concept of conversational data science but also underscores the potency of LLMs in solving complex tasks.
arXiv Detail & Related papers (2023-05-23T04:00:16Z)
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization [3.6393183544320236]
Speech recognition has become an important challenge when using deep learning (DL) It requires large-scale training datasets and high computational and storage resources. Deep transfer learning (DTL) has been introduced to overcome these issues.
arXiv Detail & Related papers (2023-04-27T21:08:05Z)
OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge Collaborative AutoML System [85.8338446357469]
We introduce OmniForce, a human-centered AutoML system that yields both human-assisted ML and ML-assisted human techniques. We show how OmniForce can put an AutoML system into practice and build adaptive AI in open-environment scenarios.
arXiv Detail & Related papers (2023-03-01T13:35:22Z)
DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems [81.21462458089142]
Data-centric AI is emerging as a unifying paradigm that could enable reliable end-to-end pipelines. We propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations. This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development.
arXiv Detail & Related papers (2022-11-09T17:32:09Z)
Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML. We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML. Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.