The Lean Data Scientist: Recent Advances towards Overcoming the Data
Bottleneck
- URL: http://arxiv.org/abs/2211.07959v1
- Date: Tue, 15 Nov 2022 07:44:56 GMT
- Title: The Lean Data Scientist: Recent Advances towards Overcoming the Data
Bottleneck
- Authors: Chen Shani, Jonathan Zarecki, Dafna Shahaf
- Abstract summary: Machine learning (ML) is revolutionizing the world, affecting almost every field of science and industry.
Recent algorithms are increasingly data-hungry, requiring large datasets for training.
However, obtaining quality datasets of such magnitude proves to be a difficult challenge.
- Score: 16.18460753647167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) is revolutionizing the world, affecting almost every
field of science and industry. Recent algorithms (in particular, deep networks)
are increasingly data-hungry, requiring large datasets for training. Thus, the
dominant paradigm in ML today involves constructing large, task-specific
datasets.
However, obtaining quality datasets of such magnitude proves to be a
difficult challenge. A variety of methods have been proposed to address this
data bottleneck problem, but they are scattered across different areas, and it
is hard for a practitioner to keep up with the latest developments. In this
work, we propose a taxonomy of these methods. Our goal is twofold: (1) We wish
to raise the community's awareness of the methods that already exist and
encourage more efficient use of resources, and (2) we hope that such a taxonomy
will contribute to our understanding of the problem, inspiring novel ideas and
strategies to replace current annotation-heavy approaches.
Related papers
- Dataset Growth [59.68869191071907]
InfoGrowth is an efficient online algorithm for data cleaning and selection.
It can improve data quality/efficiency on both single-modal and multi-modal tasks.
arXiv Detail & Related papers (2024-05-28T16:43:57Z) - Data Optimization in Deep Learning: A Survey [3.1274367448459253]
This study aims to organize a wide range of existing data optimization methodologies for deep learning.
The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension.
The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques.
arXiv Detail & Related papers (2023-10-25T09:33:57Z) - A Survey of Label-Efficient Deep Learning for 3D Point Clouds [109.07889215814589]
This paper presents the first comprehensive survey of label-efficient learning of point clouds.
We propose a taxonomy that organizes label-efficient learning methods based on the data prerequisites provided by different types of labels.
For each approach, we outline the problem setup and provide an extensive literature review that showcases relevant progress and challenges.
arXiv Detail & Related papers (2023-05-31T12:54:51Z) - Towards Label-Efficient Incremental Learning: A Survey [42.603603392991715]
We study incremental learning, where a learner is required to adapt to an incoming stream of data with a varying distribution.
We identify three subdivisions, namely semi-, few-shot- and self-supervised learning to reduce labeling efforts.
arXiv Detail & Related papers (2023-02-01T10:24:55Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - Data Augmentation techniques in time series domain: A survey and
taxonomy [0.20971479389679332]
Deep neural networks used to work with time series heavily depend on the size and consistency of the datasets used in training.
This work systematically reviews the current state-of-the-art in the area to provide an overview of all available algorithms.
The ultimate aim of this study is to provide a summary of the evolution and performance of areas that produce better results to guide future researchers in this field.
arXiv Detail & Related papers (2022-06-25T17:09:00Z) - Bi-level Alignment for Cross-Domain Crowd Counting [113.78303285148041]
Current methods rely on external data for training an auxiliary task or apply an expensive coarse-to-fine estimation.
We develop a new adversarial learning based method, which is simple and efficient to apply.
We evaluate our approach on five real-world crowd counting benchmarks, where we outperform existing approaches by a large margin.
arXiv Detail & Related papers (2022-05-12T02:23:25Z) - Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning.
I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z) - Few-shot Partial Multi-view Learning [103.33865779721458]
We propose a new task called few-shot partial multi-view learning.
It focuses on overcoming the negative impact of the view-missing issue in the low-data regime.
We conduct extensive experiments to evaluate our method.
arXiv Detail & Related papers (2021-05-05T13:34:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.