Related papers: Data-driven Discovery with Large Generative Models

Data-driven Discovery with Large Generative Models

URL: http://arxiv.org/abs/2402.13610v1
Date: Wed, 21 Feb 2024 08:26:43 GMT
Title: Data-driven Discovery with Large Generative Models
Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, Peter Clark
Abstract summary: This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) We demonstrate how LGMs fulfill several desideratas for an ideal data-driven discovery system. We advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms.
Score: 47.324203863823335
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.

Related papers

Reinforcement Learning-based Feature Generation Algorithm for Scientific Data [6.449769135199048]
Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features.<n>This paper proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies.
arXiv Detail & Related papers (2025-07-04T11:52:09Z)
Data Heterogeneity Modeling for Trustworthy Machine Learning [25.732841312561586]
Data heterogeneity plays a pivotal role in determining the performance of machine learning (ML) systems.<n>Traditional algorithms often overlook the intrinsic diversity within datasets.<n>We show how a deeper understanding of data diversity can enhance model robustness, fairness, and reliability.
arXiv Detail & Related papers (2025-06-01T11:36:56Z)
Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance. Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws. We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z)
GDM4MMIMO: Generative Diffusion Models for Massive MIMO Communications [61.56610953012228]
generative diffusion model (GDM) is one of state-of-the-art families of generative models. GDM demonstrates exceptional capability to learn implicit prior knowledge and robust generalization capabilities. Case study shows GDM's promising potential for facilitating efficient ultra-dimensional channel statement information acquisition.
arXiv Detail & Related papers (2024-12-24T08:42:01Z)
Generative Fuzzy System for Sequence Generation [16.20988290308979]
We introduce the fuzzy system, a classical modeling method that combines data and knowledge-driven mechanisms, to generative tasks. We propose an end-to-end GenFS-based model for sequence generation, called FuzzyS2S. A series of experimental studies were conducted on 12 datasets, covering three distinct categories of generative tasks.
arXiv Detail & Related papers (2024-11-21T06:03:25Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Dataset Regeneration for Sequential Recommendation [69.93516846106701]
We propose a data-centric paradigm for developing an ideal training dataset using a model-agnostic dataset regeneration framework called DR4SR. To demonstrate the effectiveness of the data-centric paradigm, we integrate our framework with various model-centric methods and observe significant performance improvements across four widely adopted datasets.
arXiv Detail & Related papers (2024-05-28T03:45:34Z)
SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control [59.20038082523832]
We present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We develop a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data.
arXiv Detail & Related papers (2024-03-28T14:07:13Z)
Filling the Missing: Exploring Generative AI for Enhanced Federated Learning over Heterogeneous Mobile Edge Devices [72.61177465035031]
We propose a generative AI-empowered federated learning to address these challenges by leveraging the idea of FIlling the MIssing (FIMI) portion of local data. Experiment results demonstrate that FIMI can save up to 50% of the device-side energy to achieve the target global test accuracy.
arXiv Detail & Related papers (2023-10-21T12:07:04Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
How Can Subgroup Discovery Help AIOps? [0.0]
We study how Subgroup Discovery can help AIOps. This project involves both data mining researchers and practitioners from Infologic, a French software editor.
arXiv Detail & Related papers (2021-09-10T14:41:02Z)
INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision] [30.411996388471817]
INODE is an end-to-end data exploration system. We demonstrate it in three significant use cases in the fields of Cancer Biomarker Reearch, Research and Innovation Policy Making, and Astrophysics.
arXiv Detail & Related papers (2021-04-09T05:04:04Z)
From Data to Actions in Intelligent Transportation Systems: a Prescription of Functional Requirements for Model Actionability [10.27718355111707]
This work aims to describe how data, coming from diverse ITS sources, can be used to learn and adapt data-driven models for efficiently operating ITS assets, systems and processes. Grounded in this described data modeling pipeline for ITS, wedefine the characteristics, engineering requisites and intrinsic challenges to its three compounding stages, namely, data fusion, adaptive learning and model evaluation.
arXiv Detail & Related papers (2020-02-06T12:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.