Data Collection and Labeling Techniques for Machine Learning
- URL: http://arxiv.org/abs/2407.12793v1
- Date: Wed, 19 Jun 2024 06:01:28 GMT
- Title: Data Collection and Labeling Techniques for Machine Learning
- Authors: Qianyu Huang, Tongfang Zhao,
- Abstract summary: Data collection and labeling are critical bottlenecks in the deployment of machine learning applications.
This paper provides a review of the state-of-the-art methods in data collection, data labeling, and the improvement of existing data and models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data collection and labeling are critical bottlenecks in the deployment of machine learning applications. With the increasing complexity and diversity of applications, the need for efficient and scalable data collection and labeling techniques has become paramount. This paper provides a review of the state-of-the-art methods in data collection, data labeling, and the improvement of existing data and models. By integrating perspectives from both the machine learning and data management communities, we aim to provide a holistic view of the current landscape and identify future research directions.
Related papers
- Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework [1.5993707490601146]
We evaluate data practices in machine learning as data curation practices.
We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles.
arXiv Detail & Related papers (2024-05-04T16:21:05Z) - AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - Towards Data-centric Graph Machine Learning: Review and Outlook [120.64417630324378]
We introduce a systematic framework, Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of the graph data lifecycle.
A thorough taxonomy of each stage is presented to answer three critical graph-centric questions.
We pinpoint the future prospects of the DC-GML domain, providing insights to navigate its advancements and applications.
arXiv Detail & Related papers (2023-09-20T00:40:13Z) - Designing Data: Proactive Data Collection and Iteration for Machine
Learning [12.295169687537395]
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications.
New methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability.
arXiv Detail & Related papers (2023-01-24T21:40:29Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning.
I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z) - Data and its (dis)contents: A survey of dataset development and use in
machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning.
We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - Monitoring and explainability of models in production [58.720142291102135]
Monitoring deployed models is crucial for continued provision of high quality machine learning enabled services.
We discuss the challenges to successful implementation of solutions in each of these areas with some recent examples of production ready solutions using open source tools.
arXiv Detail & Related papers (2020-07-13T10:37:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.