METAM: Goal-Oriented Data Discovery
- URL: http://arxiv.org/abs/2304.09068v1
- Date: Tue, 18 Apr 2023 15:42:25 GMT
- Title: METAM: Goal-Oriented Data Discovery
- Authors: Sainyam Galhotra and Yue Gong and Raul Castro Fernandez
- Abstract summary: METAM is a goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process.
We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks.
- Score: 9.73435089036831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data is a central component of machine learning and causal inference tasks.
The availability of large amounts of data from sources such as open data
repositories, data lakes and data marketplaces creates an opportunity to
augment data and boost those tasks' performance. However, augmentation
techniques rely on a user manually discovering and shortlisting useful
candidate augmentations. Existing solutions do not leverage the synergy between
discovery and augmentation, thus under exploiting data.
In this paper, we introduce METAM, a novel goal-oriented framework that
queries the downstream task with a candidate dataset, forming a feedback loop
that automatically steers the discovery and augmentation process. To select
candidates efficiently, METAM leverages properties of the: i) data, ii) utility
function, and iii) solution set size. We show METAM's theoretical guarantees
and demonstrate those empirically on a broad set of tasks. All in all, we
demonstrate the promise of goal-oriented data discovery to modern data science
applications.
Related papers
- Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs [0.061446808540639365]
This work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing a human-centered manner complex analytics.
The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems.
arXiv Detail & Related papers (2024-11-01T20:45:23Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - On the Use of Interpretable Machine Learning for the Management of Data
Quality [13.075880857448059]
We propose the use of interpretable machine learning to deliver the features that are important to be based for any data processing activity.
Our aim is to secure data quality, at least, for those features that are detected as significant in the collected datasets.
arXiv Detail & Related papers (2020-07-29T08:49:32Z) - Mining Implicit Entity Preference from User-Item Interaction Data for
Knowledge Graph Completion via Adversarial Learning [82.46332224556257]
We propose a novel adversarial learning approach by leveraging user interaction data for the Knowledge Graph Completion task.
Our generator is isolated from user interaction data, and serves to improve the performance of the discriminator.
To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks.
arXiv Detail & Related papers (2020-03-28T05:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.