Data-driven Discovery with Large Generative Models
- URL: http://arxiv.org/abs/2402.13610v1
- Date: Wed, 21 Feb 2024 08:26:43 GMT
- Title: Data-driven Discovery with Large Generative Models
- Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita
Hazra, Ashish Sabharwal, Peter Clark
- Abstract summary: This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs)
We demonstrate how LGMs fulfill several desideratas for an ideal data-driven discovery system.
We advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms.
- Score: 47.324203863823335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the accumulation of data at an unprecedented rate, its potential to fuel
scientific discovery is growing exponentially. This position paper urges the
Machine Learning (ML) community to exploit the capabilities of large generative
models (LGMs) to develop automated systems for end-to-end data-driven discovery
-- a paradigm encompassing the search and verification of hypotheses purely
from a set of provided datasets, without the need for additional data
collection or physical experiments. We first outline several desiderata for an
ideal data-driven discovery system. Then, through DATAVOYAGER, a
proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of
these desiderata -- a feat previously unattainable -- while also highlighting
important limitations in the current system that open up opportunities for
novel ML research. We contend that achieving accurate, reliable, and robust
end-to-end discovery systems solely through the current capabilities of LGMs is
challenging. We instead advocate for fail-proof tool integration, along with
active user moderation through feedback mechanisms, to foster data-driven
scientific discoveries with efficiency and reproducibility.
Related papers
- DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control [59.20038082523832]
We present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications.
We develop a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data.
arXiv Detail & Related papers (2024-03-28T14:07:13Z) - Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future [130.87142103774752]
This review systematically assesses over seventy open-source autonomous driving datasets.
It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets.
It also delves into the scientific and technical challenges that warrant resolution.
arXiv Detail & Related papers (2023-12-06T10:46:53Z) - Filling the Missing: Exploring Generative AI for Enhanced Federated
Learning over Heterogeneous Mobile Edge Devices [72.61177465035031]
We propose a generative AI-empowered federated learning to address these challenges by leveraging the idea of FIlling the MIssing (FIMI) portion of local data.
Experiment results demonstrate that FIMI can save up to 50% of the device-side energy to achieve the target global test accuracy.
arXiv Detail & Related papers (2023-10-21T12:07:04Z) - A spectrum of physics-informed Gaussian processes for regression in
engineering [0.0]
Despite the growing availability of sensing and data in general, we remain unable to fully characterise many in-service engineering systems and structures from a purely data-driven approach.
This paper pursues the combination of machine learning technology and physics-based reasoning to enhance our ability to make predictive models with limited data.
arXiv Detail & Related papers (2023-09-19T14:39:03Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - How Can Subgroup Discovery Help AIOps? [0.0]
We study how Subgroup Discovery can help AIOps.
This project involves both data mining researchers and practitioners from Infologic, a French software editor.
arXiv Detail & Related papers (2021-09-10T14:41:02Z) - INODE: Building an End-to-End Data Exploration System in Practice
[Extended Vision] [30.411996388471817]
INODE is an end-to-end data exploration system.
We demonstrate it in three significant use cases in the fields of Cancer Biomarker Reearch, Research and Innovation Policy Making, and Astrophysics.
arXiv Detail & Related papers (2021-04-09T05:04:04Z) - From Data to Actions in Intelligent Transportation Systems: a
Prescription of Functional Requirements for Model Actionability [10.27718355111707]
This work aims to describe how data, coming from diverse ITS sources, can be used to learn and adapt data-driven models for efficiently operating ITS assets, systems and processes.
Grounded in this described data modeling pipeline for ITS, wedefine the characteristics, engineering requisites and intrinsic challenges to its three compounding stages, namely, data fusion, adaptive learning and model evaluation.
arXiv Detail & Related papers (2020-02-06T12:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.