Towards Free Data Selection with General-Purpose Models
- URL: http://arxiv.org/abs/2309.17342v2
- Date: Sat, 14 Oct 2023 22:43:50 GMT
- Title: Towards Free Data Selection with General-Purpose Models
- Authors: Yichen Xie, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan
- Abstract summary: A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
- Score: 71.92151210413374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A desirable data selection algorithm can efficiently choose the most
informative samples to maximize the utility of limited annotation budgets.
However, current approaches, represented by active learning methods, typically
follow a cumbersome pipeline that iterates the time-consuming model training
and batch data selection repeatedly. In this paper, we challenge this status
quo by designing a distinct data selection pipeline that utilizes existing
general-purpose models to select data from various datasets with a single-pass
inference without the need for additional training or supervision. A novel free
data selection (FreeSel) method is proposed following this new pipeline.
Specifically, we define semantic patterns extracted from inter-mediate features
of the general-purpose model to capture subtle local information in each image.
We then enable the selection of all data samples in a single pass through
distance-based sampling at the fine-grained semantic pattern level. FreeSel
bypasses the heavy batch selection process, achieving a significant improvement
in efficiency and being 530x faster than existing active learning methods.
Extensive experiments verify the effectiveness of FreeSel on various computer
vision tasks. Our code is available at https://github.com/yichen928/FreeSel.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Data curation via joint example selection further accelerates multimodal learning [3.329535792151987]
We show that jointly selecting batches of data is more effective for learning than selecting examples independently.
We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points.
arXiv Detail & Related papers (2024-06-25T16:52:37Z) - Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process.
vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner.
We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z) - BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges [12.248397169100784]
Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training.
We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores.
arXiv Detail & Related papers (2024-06-05T08:33:09Z) - AdaSelection: Accelerating Deep Learning Training through Data
Subsampling [27.46630703428186]
We introduce AdaSelection, an adaptive sub-sampling method to identify the most informative sub-samples within each minibatch.
Compared with industry-standard baselines, AdaSelection consistently displays superior performance.
arXiv Detail & Related papers (2023-06-19T07:01:28Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Towards General and Efficient Active Learning [20.888364610175987]
Active learning aims to select the most informative samples to exploit limited annotation budgets.
We propose a novel general and efficient active learning (GEAL) method in this paper.
Our method can conduct data selection processes on different datasets with a single-pass inference of the same model.
arXiv Detail & Related papers (2021-12-15T08:35:28Z) - Online Active Model Selection for Pre-trained Classifiers [72.84853880948894]
We design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round.
Our algorithm can be used for online prediction tasks for both adversarial and streams.
arXiv Detail & Related papers (2020-10-19T19:53:15Z) - On Deep Unsupervised Active Learning [41.579343330613675]
Unsupervised active learning aims to select representative samples in an unsupervised setting for human annotating.
In this paper, we present a novel Deep neural network framework for Unsupervised Active Learning.
arXiv Detail & Related papers (2020-07-28T02:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.