Related papers: Similarity Search for Efficient Active Learning and Search of Rare Concepts

Similarity Search for Efficient Active Learning and Search of Rare Concepts

URL: http://arxiv.org/abs/2007.00077v2
Date: Thu, 22 Jul 2021 16:54:12 GMT
Title: Similarity Search for Efficient Active Learning and Search of Rare Concepts
Authors: Cody Coleman, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter Bailis, Alexander C. Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia, I. Zeki Yalniz
Abstract summary: We improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set. Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, thus enabling web-scale active learning.
Score: 78.5475382904847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a de-identified and aggregated dataset of 10 billion images provided by a large internet company. Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, thus enabling web-scale active learning.

Related papers

TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection [26.059907173437114]
TSceneJAL framework can efficiently sample the balanced, diverse, and complex traffic scenes from both labeled and unlabeled data. Our approach outperforms existing state-of-the-art methods on 3D object detection tasks with up to 12% improvements.
arXiv Detail & Related papers (2024-12-25T11:07:04Z)
Learning from the Best: Active Learning for Wireless Communications [9.523381807291049]
Active learning algorithms identify the most critical and informative samples in an unlabeled dataset and label only those samples, instead of the complete set. We present a case study of deep learning-based mmWave beam selection, where labeling is performed by a compute-intensive algorithm based on exhaustive search. Our results show that using an active learning algorithm for class-imbalanced datasets can reduce labeling overhead by up to 50% for this dataset.
arXiv Detail & Related papers (2024-01-23T12:21:57Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
Two-Step Active Learning for Instance Segmentation with Uncertainty and Diversity Sampling [20.982992381790034]
We propose a post-hoc active learning algorithm that integrates uncertainty-based sampling with diversity-based sampling. Our proposed algorithm is not only simple and easy to implement, but it also delivers superior performance on various datasets.
arXiv Detail & Related papers (2023-09-28T03:40:30Z)
Cold PAWS: Unsupervised class discovery and addressing the cold-start problem for semi-supervised learning [0.30458514384586394]
We propose a novel approach based on well-established self-supervised learning, clustering, and manifold learning techniques. We test our approach using several publicly available datasets, namely CIFAR10, Imagenette, DeepWeeds, and EuroSAT. We obtain superior performance for the datasets considered with a much simpler approach compared to other methods in the literature.
arXiv Detail & Related papers (2023-05-17T09:17:59Z)
Exploiting Diversity of Unlabeled Data for Label-Efficient Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling. We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting. Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z)
Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space. We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z)
Budget-aware Few-shot Learning via Graph Convolutional Network [56.41899553037247]
This paper tackles the problem of few-shot learning, which aims to learn new visual concepts from a few examples. A common problem setting in few-shot classification assumes random sampling strategy in acquiring data labels. We introduce a new budget-aware few-shot learning problem that aims to learn novel object categories.
arXiv Detail & Related papers (2022-01-07T02:46:35Z)
Big Self-Supervised Models are Strong Semi-Supervised Learners [116.00752519907725]
We show that it is surprisingly effective for semi-supervised learning on ImageNet. A key ingredient of our approach is the use of big (deep and wide) networks during pretraining and fine-tuning. We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network.
arXiv Detail & Related papers (2020-06-17T17:48:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.