Large Language Models are Demonstration Pre-Selectors for Themselves
- URL: http://arxiv.org/abs/2506.06033v1
- Date: Fri, 06 Jun 2025 12:29:03 GMT
- Title: Large Language Models are Demonstration Pre-Selectors for Themselves
- Authors: Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang,
- Abstract summary: In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data.<n>FEw yet Essential Demonstration prE-selectoR is a novel pre-selection framework that identifies a representative subset of demonstrations.<n>FEw yet Essential Demonstration prE-selectoR can reduce training data size by over 20% while maintaining performance.
- Score: 57.101804269100185
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the "sufficiency" and "necessity" metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full training data, improving efficiency while maintaining comparable performance in ICL. Additionally, our pre-selected subset also benefits fine-tuning LLMs, where we introduce a bi-level optimization method that enhances training efficiency without sacrificing performance. Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL.
Related papers
- Towards Compute-Optimal Many-Shot In-Context Learning [63.815463719071055]
We propose two strategies for demonstration selection in many-shot ICL.<n>First method combines a small number of demonstrations, selected based on similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached.<n>Second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering.
arXiv Detail & Related papers (2025-07-22T04:21:03Z) - Leveraging In-Context Learning for Language Model Agents [51.2996117207114]
In-context learning (ICL) with dynamically selected demonstrations combines the flexibility of prompting large language models (LLMs) with the ability to leverage training data to improve performance.<n>We show that set-selection of trajectories of similar tasks as demonstrations significantly improves performance, reliability, robustness, and efficiency of LLM agents.<n>We find that demonstrations obtained from larger models (in the annotation phase) also improve smaller models, and that ICL agents can even rival costlier trained agents.
arXiv Detail & Related papers (2025-06-16T05:37:49Z) - Learning to Select In-Context Demonstration Preferred by Large Language Model [21.077656767563255]
In-context learning (ICL) enables large language models to adapt to new tasks during inference using only a few demonstrations.<n>We propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL.<n>Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations.
arXiv Detail & Related papers (2025-05-26T13:26:56Z) - Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [41.4789135538612]
This paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>Thanks to the advanced language understanding capabilities of Large Language Models (LLMs), we utilize LLMs to evaluate the value of each option during the selection process.
arXiv Detail & Related papers (2025-03-04T07:32:41Z) - PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [28.442470930703337]
PRISM is a training-free approach for efficient multimodal data selection.<n>It uses Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs.<n>It reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods.
arXiv Detail & Related papers (2025-02-17T18:43:41Z) - Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration [39.16321257800402]
We propose a multi-actor collaborative data selection mechanism to accelerate the pretraining of language models (LMs)<n>Each data selection method independently prioritizes data based on its criterion and updates its prioritization rules using the current state of the model.<n>A console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process.
arXiv Detail & Related papers (2024-10-10T16:45:28Z) - Large Language Models Know What Makes Exemplary Contexts [42.90814615222177]
In-context learning (ICL) has proven to be a significant capability with the advancement of Large Language models (LLMs)
This paper presents a unified framework for LLMs that allows them to self-select influential in-context examples to compose their contexts.
arXiv Detail & Related papers (2024-08-14T12:32:41Z) - ParaICL: Towards Parallel In-Context Learning [74.38022919598443]
Large language models (LLMs) have become the norm in natural language processing.<n>Few-shot in-context learning (ICL) relies on the choice of few-shot demonstration examples.<n>We propose a novel method named parallel in-context learning (ParaICL)
arXiv Detail & Related papers (2024-03-31T05:56:15Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.