VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs
- URL: http://arxiv.org/abs/2512.12072v1
- Date: Fri, 12 Dec 2025 22:39:01 GMT
- Title: VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs
- Authors: Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi,
- Abstract summary: We propose Voyager, a principled approach to generate diverse datasets.<n>Our approach is iterative and directly optimize a mathematical quantity.<n>Our approach is training-free, applicable to closed-source models, and scalable.
- Score: 12.406675425503702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.
Related papers
- Federated Online Learning for Heterogeneous Multisource Streaming Data [0.0]
Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns.<n>In this paper, we propose a federated online learning (FOL) method for distributed multi-source streaming data analysis.
arXiv Detail & Related papers (2025-08-08T19:08:53Z) - Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning [49.10890099624699]
We introduce a dynamic dataset pruning framework that adaptively selects training samples based on task-driven difficulty and cross-modality semantic consistency.<n>Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
arXiv Detail & Related papers (2025-07-17T03:08:26Z) - Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data [54.3895971080712]
Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains.<n>We propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data.
arXiv Detail & Related papers (2025-02-05T17:21:01Z) - CorrSynth -- A Correlated Sampling Method for Diverse Dataset Generation from LLMs [5.89889361990138]
Large language models (LLMs) have demonstrated remarkable performance in diverse tasks using zero-shot and few-shot prompting.
In this work, we tackle the challenge of generating datasets with high diversity, upon which a student model is trained for downstream tasks.
Taking the route of decoding-time guidance-based approaches, we propose Corr Synth, which generates data that is more diverse and faithful to the input prompt using a correlated sampling strategy.
arXiv Detail & Related papers (2024-11-13T12:09:23Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - Learning Distributionally Robust Models at Scale via Composite
Optimization [45.47760229170775]
We show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods.
We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
arXiv Detail & Related papers (2022-03-17T20:47:42Z) - Invariance Learning in Deep Neural Networks with Differentiable Laplace
Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation.
We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z) - Reinforced Data Sampling for Model Diversification [15.547681142342846]
This paper proposes a new Reinforced Data Sampling (RDS) method to learn how to sample data adequately.
We formulate the optimisation problem of model diversification $delta-div$ in data sampling to maximise learning potentials and optimum allocation by injecting model diversity.
Our results suggest that the trainable sampling for model diversification is useful for competition organisers, researchers, or even starters to pursue full potentials of various machine learning tasks.
arXiv Detail & Related papers (2020-06-12T11:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.