Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation for Classification
- URL: http://arxiv.org/abs/2410.10756v1
- Date: Mon, 14 Oct 2024 17:30:08 GMT
- Title: Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation for Classification
- Authors: Jan Cegin, Branislav Pecher, Jakub Simko, Ivan Srba, Maria Bielikova, Peter Brusilovsky,
- Abstract summary: generative large language models (LLMs) are increasingly used for data augmentation tasks.
We compare sample selection strategies existing in few-shot learning literature and investigate their effects in LLM-based textual augmentation.
Results indicate, that while some informed'' selection strategies increase the performance of models, it happens only seldom and with marginal performance increases.
- Score: 6.273933281069326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for classifier fine-tuning. Existing works on augmentation leverage the few-shot scenarios, where samples are given to LLMs as part of prompts, leading to better augmentations. Yet, the samples are mostly selected randomly and a comprehensive overview of the effects of other (more ``informed'') sample selection strategies is lacking. In this work, we compare sample selection strategies existing in few-shot learning literature and investigate their effects in LLM-based textual augmentation. We evaluate this on in-distribution and out-of-distribution classifier performance. Results indicate, that while some ``informed'' selection strategies increase the performance of models, especially for out-of-distribution data, it happens only seldom and with marginal performance increases. Unless further advances are made, a default of random sample selection remains a good option for augmentation practitioners.
Related papers
- Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection [3.9620215314408984]
Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks.
We observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings.
By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion $alpha$-Maximum Mean Discrepancy ($alpha$-MMD), RDSS samples a representative subset for annotation from the unlabeled data.
arXiv Detail & Related papers (2024-09-18T02:40:31Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Towards Informative Few-Shot Prompt with Maximum Information Gain for
In-Context Learning [30.536184852029386]
Large Language models (LLMs) possess the capability to engage In-context Learning (ICL)
LLMs possess the capability to engage In-context Learning (ICL) by leveraging a few demonstrations pertaining to a new downstream task as conditions.
However, this particular learning paradigm suffers from high instability stemming from substantial variances induced by factors such as the input distribution of selected examples, their ordering, and prompt formats.
arXiv Detail & Related papers (2023-10-13T07:49:11Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - How to choose "Good" Samples for Text Data Augmentation [4.60495447017298]
We propose a novel self-training framework with two selectors to select high-quality samples from data augmentation.
Specifically, we firstly use an entropy-based strategy and the model prediction to select augmented samples.
Considering some samples with high quality at the above step may be wrongly filtered, we propose to recall them from two perspectives of word overlap and semantic similarity.
arXiv Detail & Related papers (2023-02-02T06:01:50Z) - ReSmooth: Detecting and Utilizing OOD Samples when Training with Data
Augmentation [57.38418881020046]
Recent DA techniques always meet the need for diversity in augmented training samples.
An augmentation strategy that has a high diversity usually introduces out-of-distribution (OOD) augmented samples.
We propose ReSmooth, a framework that firstly detects OOD samples in augmented samples and then leverages them.
arXiv Detail & Related papers (2022-05-25T09:29:27Z) - SelectAugment: Hierarchical Deterministic Sample Selection for Data
Augmentation [72.58308581812149]
We propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner.
Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio.
In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved.
arXiv Detail & Related papers (2021-12-06T08:38:38Z) - On Training Instance Selection for Few-Shot Neural Text Generation [9.37935464602938]
We present a study on training instance selection in few-shot neural text generation.
We propose a simple selection strategy with K-means clustering.
We show that the generation models consistently outperform random sampling on three text generation tasks.
arXiv Detail & Related papers (2021-07-07T12:16:16Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - True Few-Shot Learning with Language Models [78.42578316883271]
We evaluate the few-shot ability of LMs when held-out examples are unavailable.
Our findings suggest that prior work significantly overestimated the true few-shot ability of LMs.
arXiv Detail & Related papers (2021-05-24T17:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.