Active learning for data streams: a survey
- URL: http://arxiv.org/abs/2302.08893v4
- Date: Wed, 29 Nov 2023 21:07:15 GMT
- Title: Active learning for data streams: a survey
- Authors: Davide Cacciarelli, Murat Kulahci
- Abstract summary: Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream.
Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data.
This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in real time.
- Score: 0.48951183832371004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online active learning is a paradigm in machine learning that aims to select
the most informative data points to label from a data stream. The problem of
minimizing the cost associated with collecting labeled observations has gained
a lot of attention in recent years, particularly in real-world applications
where data is only available in an unlabeled form. Annotating each observation
can be time-consuming and costly, making it difficult to obtain large amounts
of labeled data. To overcome this issue, many active learning strategies have
been proposed in the last decades, aiming to select the most informative
observations for labeling in order to improve the performance of machine
learning models. These approaches can be broadly divided into two categories:
static pool-based and stream-based active learning. Pool-based active learning
involves selecting a subset of observations from a closed pool of unlabeled
data, and it has been the focus of many surveys and literature reviews.
However, the growing availability of data streams has led to an increase in the
number of approaches that focus on online active learning, which involves
continuously selecting and labeling observations as they arrive in a stream.
This work aims to provide an overview of the most recently proposed approaches
for selecting the most informative observations from data streams in real time.
We review the various techniques that have been proposed and discuss their
strengths and limitations, as well as the challenges and opportunities that
exist in this area of research.
Related papers
- Granularity Matters in Long-Tail Learning [62.30734737735273]
We offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance.
We introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes.
To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss.
arXiv Detail & Related papers (2024-10-21T13:06:21Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - A Survey of Label-Efficient Deep Learning for 3D Point Clouds [109.07889215814589]
This paper presents the first comprehensive survey of label-efficient learning of point clouds.
We propose a taxonomy that organizes label-efficient learning methods based on the data prerequisites provided by different types of labels.
For each approach, we outline the problem setup and provide an extensive literature review that showcases relevant progress and challenges.
arXiv Detail & Related papers (2023-05-31T12:54:51Z) - Responsible Active Learning via Human-in-the-loop Peer Study [88.01358655203441]
We propose a responsible active learning method, namely Peer Study Learning (PSL), to simultaneously preserve data privacy and improve model stability.
We first introduce a human-in-the-loop teacher-student architecture to isolate unlabelled data from the task learner (teacher) on the cloud-side.
During training, the task learner instructs the light-weight active learner which then provides feedback on the active sampling criterion.
arXiv Detail & Related papers (2022-11-24T13:18:27Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Reinforced Meta Active Learning [11.913086438671357]
We present an online stream-based meta active learning method which learns on the fly an informativeness measure directly from the data.
The method is based on reinforcement learning and combines episodic policy search and a contextual bandits approach.
We demonstrate on several real datasets that this method learns to select training samples more efficiently than existing state-of-the-art methods.
arXiv Detail & Related papers (2022-03-09T08:36:54Z) - Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning.
I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z) - One-Round Active Learning [13.25385227263705]
One-round active learning aims to select a subset of unlabeled data points that achieve the highest utility after being labeled.
We propose DULO, a general framework for one-round active learning based on the notion of data utility functions.
Our results demonstrate that while existing active learning approaches could succeed with multiple rounds, DULO consistently performs better in the one-round setting.
arXiv Detail & Related papers (2021-04-23T23:59:50Z) - Data Shapley Valuation for Efficient Batch Active Learning [21.76249748709411]
Active Data Shapley (ADS) is a filtering layer for batch active learning.
We show that ADS is particularly effective when the pool of unlabeled data exhibits real-world caveats.
arXiv Detail & Related papers (2021-04-16T18:53:42Z) - Active Learning: Problem Settings and Recent Developments [2.1574781022415364]
This paper explains the basic problem settings of active learning and recent research trends.
In particular, research on learning acquisition functions to select samples from the data for labeling, theoretical work on active learning algorithms, and stopping criteria for sequential data acquisition are highlighted.
arXiv Detail & Related papers (2020-12-08T05:24:06Z) - The Emerging Trends of Multi-Label Learning [45.63795570392158]
Exabytes of data are generated daily by humans, leading to the growing need for new efforts in dealing with the grand challenges for multi-label learning brought by big data.
There is a lack of systemic studies that focus explicitly on analyzing the emerging trends and new challenges of multi-label learning in the era of big data.
It is imperative to call for a comprehensive survey to fulfill this mission and delineate future research directions and new applications.
arXiv Detail & Related papers (2020-11-23T03:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.