Related papers: Language Model-Driven Data Pruning Enables Efficient Active Learning

Language Model-Driven Data Pruning Enables Efficient Active Learning

URL: http://arxiv.org/abs/2410.04275v1
Date: Sat, 5 Oct 2024 19:46:11 GMT
Title: Language Model-Driven Data Pruning Enables Efficient Active Learning
Authors: Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza,
Abstract summary: We introduce a plug-and-play unlabeled data pruning strategy, ActivePrune, to prune the unlabeled pool. To enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method. Experiments on translation, sentiment analysis, topic classification, and summarization tasks demonstrate that ActivePrune outperforms existing data pruning methods.
Score: 6.816044132563518
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. A key component in this procedure is an acquisition function that guides the selection process and identifies the suitable instances for labeling from the unlabeled pool. However, these acquisition methods suffer from high computational costs with large unlabeled data pools, posing a roadblock to their applicability on large datasets. To address this challenge and bridge this gap, we introduce a novel plug-and-play unlabeled data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. Additionally, to enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection in subsequent labeling iterations. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on four diverse datasets and four active learning strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality $\leftrightarrow$ efficiency tradeoff of the data pruning methods and demonstrate that ActivePrune is computationally more efficient than other LLM score-based pruning methods, and provides up to 74% reduction in the end-to-end time required for active learning.

Related papers

Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation [21.20806568508201]
We show how to leverage class text information to mitigate distribution drifts encountered by vision-language models (VLMs) during test-time inference. We propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed centroids of a label assignment problem. Experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of CLIP-OT.
arXiv Detail & Related papers (2024-11-26T00:15:37Z)
Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts. This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models [3.546617486894182]
We introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks. Results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets.
arXiv Detail & Related papers (2024-06-13T15:06:11Z)
Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications. The quality of these exemplars in the prompt greatly impacts performance. Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z)
Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z)
Deep Active Learning with Contrastive Learning Under Realistic Data Pool Assumptions [2.578242050187029]
Active learning aims to identify the most informative data from an unlabeled data pool that enables a model to reach the desired accuracy rapidly. Most existing active learning methods have been evaluated in an ideal setting where only samples relevant to the target task exist in an unlabeled data pool. We introduce new active learning benchmarks that include ambiguous, task-irrelevant out-of-distribution as well as in-distribution samples.
arXiv Detail & Related papers (2023-03-25T10:46:10Z)
Exploiting Diversity of Unlabeled Data for Label-Efficient Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling. We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting. Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z)
Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z)
Low Budget Active Learning via Wasserstein Distance: An Integer Programming Approach [81.19737119343438]
Active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label. We propose a new integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool. Our strategy requires high-quality latent features which we obtain by unsupervised learning on the unlabeled pool.
arXiv Detail & Related papers (2021-06-05T21:25:03Z)
Data Shapley Valuation for Efficient Batch Active Learning [21.76249748709411]
Active Data Shapley (ADS) is a filtering layer for batch active learning. We show that ADS is particularly effective when the pool of unlabeled data exhibits real-world caveats.
arXiv Detail & Related papers (2021-04-16T18:53:42Z)
Semi-supervised Batch Active Learning via Bilevel Optimization [89.37476066973336]
We formulate our approach as a data summarization problem via bilevel optimization. We show that our method is highly effective in keyword detection tasks in the regime when only few labeled samples are available.
arXiv Detail & Related papers (2020-10-19T16:53:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.