Related papers: DMLR: Data-centric Machine Learning Research -- Past, Present and Future

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

URL: http://arxiv.org/abs/2311.13028v2
Date: Sat, 1 Jun 2024 13:28:30 GMT
Title: DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš, Ahmed Alaa, Adji Bousso Dieng, Natasha Noy, Vijay Janapa Reddi, James Zou, Praveen Paritosh, Mihaela van der Schaar, Kurt Bollacker, Lora Aroyo, Ce Zhang, Joaquin Vanschoren, Isabelle Guyon, Peter Mattson,
Abstract summary: We outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
Score: 94.06475098911947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.

Related papers

Diffusion Models for Future Networks and Communications: A Comprehensive Survey [65.97057929688499]
The rise of Generative AI (GenAI) in recent years has catalyzed transformative advances in wireless communications and networks.<n>Among the members of the GenAI family, Diffusion Models (DMs) have risen to prominence as a powerful option.<n>We aim to provide a comprehensive overview of the theoretical foundations and practical applications of DMs across future communication systems.
arXiv Detail & Related papers (2025-08-03T04:59:58Z)
The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes [3.770155074442168]
We provide a window into the process of constructing a dataset for machine learning (ML) applications by reflecting on the process of building World Wide Dishes (WWD) WWD takes a participatory approach to dataset creation: community members guide the design of the research process and engage in crowdsourcing efforts to build the dataset. We contribute empirical evidence of the invisible labour of participatory design work by analysing reflections from the research team behind WWD.
arXiv Detail & Related papers (2025-02-09T17:09:46Z)
Data clustering: an essential technique in data science [28.124442353352183]
The paper highlights key principles underpinning clustering, outlines widely used tools and frameworks, and introduces the workflow of clustering in data science. The paper concludes with insights into future research directions, emphasizing clustering's role in driving innovation and enabling data-driven decision-making.
arXiv Detail & Related papers (2024-12-25T03:14:18Z)
Future of Information Retrieval Research in the Age of Generative AI [61.56371468069577]
In the fast-evolving field of information retrieval (IR), the integration of generative AI technologies such as large language models (LLMs) is transforming how users search for and interact with information. Recognizing this paradigm shift, a visioning workshop was held in July 2024 to discuss the future of IR in the age of generative AI. This report contains a summary of discussions as potentially important research topics and contains a list of recommendations for academics, industry practitioners, institutions, evaluation campaigns, and funding agencies.
arXiv Detail & Related papers (2024-12-03T00:01:48Z)
Development of a Web-based Research Consortium Database Management System: Advancing Data-driven and Knowledge-based Project Management [0.3562485774739681]
This paper presents the development of a web-based database and real-time monitoring system for CLAARRDEC. The system is aimed at enhancing data collection, storage, retrieval, and utilization within the consortium. The system's potential extends beyond CLAARRDEC, as it could be utilized by other research consortia in the Philippines.
arXiv Detail & Related papers (2024-11-01T09:55:09Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
Analog and Multi-modal Manufacturing Datasets Acquired on the Future Factories Platform [0.0]
Two industry-grade datasets are presented in this paper. They were collected at the Future Factories Lab at the University of South Carolina on December 11th and 12th of 2023.
arXiv Detail & Related papers (2024-01-28T02:26:58Z)
Understanding LLMs: A Comprehensive Overview from Training to Inference [52.70748499554532]
Low-cost training and deployment of large language models represent the future development trend. Discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization.
arXiv Detail & Related papers (2024-01-04T02:43:57Z)
Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future [130.87142103774752]
This review systematically assesses over seventy open-source autonomous driving datasets. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets. It also delves into the scientific and technical challenges that warrant resolution.
arXiv Detail & Related papers (2023-12-06T10:46:53Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
A Roadmap for Greater Public Use of Privacy-Sensitive Government Data: Workshop Report [11.431595898012377]
The workshop specifically focused on challenges and successes in government data sharing at various levels. The first day focused on successful examples of new technology applied to sharing of public data, including formal privacy techniques, synthetic data, and cryptographic approaches.
arXiv Detail & Related papers (2022-06-17T17:20:29Z)
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research [3.536605202672355]
We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions.
arXiv Detail & Related papers (2021-12-03T05:01:47Z)
SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning [63.192289553021816]
Progress toward the United Nations Sustainable Development Goals has been hindered by a lack of data on key environmental and socioeconomic indicators. Recent advances in machine learning have made it possible to utilize abundant, frequently-updated, and globally available data, such as from satellites or social media. In this paper, we introduce SustainBench, a collection of 15 benchmark tasks across 7 SDGs.
arXiv Detail & Related papers (2021-11-08T18:59:04Z)
Data and its (dis)contents: A survey of dataset development and use in machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning. We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.