Ordered Semantically Diverse Sampling for Textual Data
- URL: http://arxiv.org/abs/2503.10698v1
- Date: Wed, 12 Mar 2025 06:38:57 GMT
- Title: Ordered Semantically Diverse Sampling for Textual Data
- Authors: Ashish Tiwari, Mukul Singh, Ananya Singha, Arjun Radhakrishna,
- Abstract summary: We introduce the ordered diverse sampling problem based on a new metric that measures the diversity in an ordered list of samples.<n>We present a novel approach for generating ordered diverse samples for textual data that uses principal components on the embedding vectors.
- Score: 6.280814487955095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of diversity sampling is to select a representative subset of data in a way that maximizes information contained in the subset while keeping its cardinality small. We introduce the ordered diverse sampling problem based on a new metric that measures the diversity in an ordered list of samples. We present a novel approach for generating ordered diverse samples for textual data that uses principal components on the embedding vectors. The proposed approach is simple and compared with existing approaches using the new metric. We transform standard text classification benchmarks into benchmarks for ordered diverse sampling. Our empirical evaluation shows that prevailing approaches perform 6% to 61% worse than our method while also being more time inefficient. Ablation studies show how the parts of the new approach contribute to the overall metrics.
Related papers
- Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.<n>Our work offers a comprehensive comparison of existing truncation sampling methods and serves as a practical user guideline for their parameter selection.
arXiv Detail & Related papers (2024-08-24T14:14:32Z) - AdaSelection: Accelerating Deep Learning Training through Data
Subsampling [27.46630703428186]
We introduce AdaSelection, an adaptive sub-sampling method to identify the most informative sub-samples within each minibatch.
Compared with industry-standard baselines, AdaSelection consistently displays superior performance.
arXiv Detail & Related papers (2023-06-19T07:01:28Z) - Structured Voronoi Sampling [61.629198273926676]
In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods.
We name our gradient-based technique Structured Voronoi Sampling (SVS)
In a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods.
arXiv Detail & Related papers (2023-06-05T17:32:35Z) - Efficient Failure Pattern Identification of Predictive Algorithms [15.02620042972929]
We propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm.
The results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.
arXiv Detail & Related papers (2023-06-01T14:54:42Z) - Finding Support Examples for In-Context Learning [73.90376920653507]
We propose LENS, a fiLter-thEN-Search method to tackle this challenge in two stages.
First we filter the dataset to obtain informative in-context examples individually.
Then we propose diversity-guided example search which iteratively refines and evaluates the selected example permutations.
arXiv Detail & Related papers (2023-02-27T06:32:45Z) - Leveraging Importance Weights in Subset Selection [45.54597544672441]
We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting.
Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches.
arXiv Detail & Related papers (2023-01-28T02:07:31Z) - Text sampling strategies for predicting missing bibliographic links [0.0]
The paper proposes various strategies for sampling text when performing automatic sentence classification.
We examine a number of sampling strategies that differ in context size and position.
This method of detecting missing links can be used in recommendation engines of applied intelligent information systems.
arXiv Detail & Related papers (2023-01-04T15:53:50Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - A Case Study on Sampling Strategies for Evaluating Neural Sequential
Item Recommendation Models [69.32128532935403]
Two well-known strategies to sample negative items are uniform random sampling and sampling by popularity.
We re-evaluate current state-of-the-art sequential recommender models from the point of view.
We find that both sampling strategies can produce inconsistent rankings compared with the full ranking of the models.
arXiv Detail & Related papers (2021-07-27T19:06:03Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.