Similarity Search for Efficient Active Learning and Search of Rare
Concepts
- URL: http://arxiv.org/abs/2007.00077v2
- Date: Thu, 22 Jul 2021 16:54:12 GMT
- Title: Similarity Search for Efficient Active Learning and Search of Rare
Concepts
- Authors: Cody Coleman, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter
Bailis, Alexander C. Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia, I.
Zeki Yalniz
- Abstract summary: We improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set.
Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, thus enabling web-scale active learning.
- Score: 78.5475382904847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many active learning and search approaches are intractable for large-scale
industrial settings with billions of unlabeled examples. Existing approaches
search globally for the optimal examples to label, scaling linearly or even
quadratically with the unlabeled data. In this paper, we improve the
computational efficiency of active learning and search methods by restricting
the candidate pool for labeling to the nearest neighbors of the currently
labeled set instead of scanning over all of the unlabeled data. We evaluate
several selection strategies in this setting on three large-scale computer
vision datasets: ImageNet, OpenImages, and a de-identified and aggregated
dataset of 10 billion images provided by a large internet company. Our approach
achieved similar mean average precision and recall as the traditional global
approach while reducing the computational cost of selection by up to three
orders of magnitude, thus enabling web-scale active learning.
Related papers
- Learning from the Best: Active Learning for Wireless Communications [9.523381807291049]
Active learning algorithms identify the most critical and informative samples in an unlabeled dataset and label only those samples, instead of the complete set.
We present a case study of deep learning-based mmWave beam selection, where labeling is performed by a compute-intensive algorithm based on exhaustive search.
Our results show that using an active learning algorithm for class-imbalanced datasets can reduce labeling overhead by up to 50% for this dataset.
arXiv Detail & Related papers (2024-01-23T12:21:57Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Two-Step Active Learning for Instance Segmentation with Uncertainty and
Diversity Sampling [20.982992381790034]
We propose a post-hoc active learning algorithm that integrates uncertainty-based sampling with diversity-based sampling.
Our proposed algorithm is not only simple and easy to implement, but it also delivers superior performance on various datasets.
arXiv Detail & Related papers (2023-09-28T03:40:30Z) - Cold PAWS: Unsupervised class discovery and addressing the cold-start
problem for semi-supervised learning [0.30458514384586394]
We propose a novel approach based on well-established self-supervised learning, clustering, and manifold learning techniques.
We test our approach using several publicly available datasets, namely CIFAR10, Imagenette, DeepWeeds, and EuroSAT.
We obtain superior performance for the datasets considered with a much simpler approach compared to other methods in the literature.
arXiv Detail & Related papers (2023-05-17T09:17:59Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space.
We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z) - Budget-aware Few-shot Learning via Graph Convolutional Network [56.41899553037247]
This paper tackles the problem of few-shot learning, which aims to learn new visual concepts from a few examples.
A common problem setting in few-shot classification assumes random sampling strategy in acquiring data labels.
We introduce a new budget-aware few-shot learning problem that aims to learn novel object categories.
arXiv Detail & Related papers (2022-01-07T02:46:35Z) - Big Self-Supervised Models are Strong Semi-Supervised Learners [116.00752519907725]
We show that it is surprisingly effective for semi-supervised learning on ImageNet.
A key ingredient of our approach is the use of big (deep and wide) networks during pretraining and fine-tuning.
We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network.
arXiv Detail & Related papers (2020-06-17T17:48:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.