Data Acquisition: A New Frontier in Data-centric AI
- URL: http://arxiv.org/abs/2311.13712v1
- Date: Wed, 22 Nov 2023 22:15:17 GMT
- Title: Data Acquisition: A New Frontier in Data-centric AI
- Authors: Lingjiao Chen, Bilge Acun, Newsha Ardalani, Yifan Sun, Feiyang Kang,
Hanrui Lyu, Yongchan Kwon, Ruoxi Jia, Carole-Jean Wu, Matei Zaharia and James
Zou
- Abstract summary: We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
- Score: 65.90972015426274
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As Machine Learning (ML) systems continue to grow, the demand for relevant
and comprehensive datasets becomes imperative. There is limited study on the
challenges of data acquisition due to ad-hoc processes and lack of consistent
methodologies. We first present an investigation of current data marketplaces,
revealing lack of platforms offering detailed information about datasets,
transparent pricing, standardized data formats. With the objective of inciting
participation from the data-centric AI community, we then introduce the DAM
challenge, a benchmark to model the interaction between the data providers and
acquirers. The benchmark was released as a part of DataPerf. Our evaluation of
the submitted strategies underlines the need for effective data acquisition
strategies in ML.
Related papers
- Data on the Move: Traffic-Oriented Data Trading Platform Powered by AI Agent with Common Sense [21.398890792164703]
We introduce a traffic-oriented data trading platform named Data on The Move (DTM)
DTM integrates traffic simulation, data trading, and Artificial Intelligent (AI) agents.
Our proposed AI agent-based pricing approach enhances data trading by offering rational prices.
arXiv Detail & Related papers (2024-07-01T06:17:18Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Data Acquisition via Experimental Design for Decentralized Data Markets [25.300193837833426]
Data markets provide a way to increase the supply of data, particularly in data-scarce domains such as healthcare.
A major challenge for a data buyer in such a market is selecting the most valuable data points from a data seller.
We propose a federated approach to the data selection problem that is inspired by linear experimental design.
arXiv Detail & Related papers (2024-03-20T18:05:52Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - A Marketplace for Trading AI Models based on Blockchain and Incentives
for IoT Data [24.847898465750667]
An emerging paradigm in Machine Learning (ML) is a federated approach where the learning model is delivered to a group of heterogeneous agents partially, allowing agents to train the model locally with their own data.
The problem of valuation of models, as well as the questions of incentives for collaborative training and trading of data/models, have received limited treatment in the literature.
In this paper, a new ecosystem of ML model trading over a trusted ML-based network is proposed. The buyer can acquire the model of interest from the ML market, and interested sellers spend local computations on their data to enhance that model's quality
arXiv Detail & Related papers (2021-12-06T08:52:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.